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EVALUATION 


The  Report  constitutes  a detailed  state-of-the-art  assessment 
of  the  QUINCE  System  for  Chi nese-Engi l$h  machine  translation 
of  S&T  literature.  The  current  version  of  the  System  consls+s 
of  nine  major  modules  controlling  the  translation  process  and 
a repertory  of  utilities  controlling  I/O  operations,  format 
conversion,  debugging  and  optimisation.  The  System's 
programming  documentation  Is  provided  separately  In  Vols  I - 9 
(application  software). and  10  - 13  (utilities).  The  System 
includes  a variety  of  data  bases,  an  extensive  software  package 
for  Implementation  of  the  translation  process,  and  a set  of 
linguistic  devices  Implicitly  Incorporated  in  linguistic  and 
programming  components  to  optimize  the  analysis  of  Chinese 
and  English.  A detailed  description  of  the  System  is  provided 
in  Section  1 1 . 


Judging  from  the  viewpoint  of  scholastic  merit,  Sections  III 
and  IV  constitute  the  most  significant  contributions  to  this 
Report.  Section  III  contains  an  exceptionally  comprehensive 
and  thoroughly  researched  cr I t I que-l n-depth  of  the  current 
state  of  the  art  in  computational  syntactic  description  of 
natural  languages,  including  a statement  of  Its  Implications 
for  a further  development  of  the  QUINCE  System.  It  is 
concluded  that  grammars  with  structured  vocabulary  play  an 
Important  role  In  a I I current  language  processing  systems, 
including  the  QUINCE  System  In  which  this  concept  1$ 
elaborated  in  a more  systematic  manner  than  In  other  language 
processing  systems.  It  „lso  appears  that  a grammar  notation 
based  on  Knuth's  attribute  grammars  offers  the  most  promising 
vista  for  a further  development  of  the  System,  Section  IV 
provides  an  exhaustive  discussion  of  possibilities  for  a 
further  consolidation  of  the  linguistic  data  base  in  terms  of 
the  featurlzed  lexicon  and  interlingual  transfer  rules.  A 
continuing  enhancement  of  the  diversified  feature  subsystem 
and  contrastive  lexlcal/syntectic  studies  of  Chinese  and 
English,  combined  with  contextual  analysis  of  language- 
specific  characteristics  of  Chinese  ire  offered  as  the  most 
promising  solutions  In  this  area. 


ZB’uNlEW  L.  PANKOWICZ 
Technical  Evaluator 
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SUMMARY 

This  report  presents  the  results  of  a nine-month  period  effort  to 
document  the  Berkeley  Chinese-English  machine  translation  system  (Ouince 
system),  to  take  inventory  of  all  research  materials,  and  to  report  on  the 
current  state  of  the  art  in  linguistic  theory,  computational  linguistics,  and 
data  processing  techniques  for  advancement  of  the  Quince  system  to  the  status 
of  an  initial  operational  capability  in  one  sub-discipline  of  physics. 

A detailed  textual  description  of  the  Quince  system  modules  plus  a body 
of  figures  and  tables  are  provided  to  assist  the  reader  in  conceptualizing 
the  system  and  reading  the  program  code  listings  appended  in  the  Supplements 
to  this  report.  An  Itemized  inventory  of  both  the  hardware  and  software  of 
the  translation  system  is  presented.  We  review  the  current  state  of  the  art 
in  new  syntactic  descriptive  methods  with  structured  vocabulary,  such  as  van 
Wijngaarden  grammars,  Roster's  affix-grammars,  and  Knuth's  attribute-grammars, 
which  were  developed  for  defining  programming  languages,  but  which  are 
suitable  for  computational  use  in  machine  translation  systems  for  natural 
languages.  The  existing  linguistic  data  base  of  the  system  is  reviewed  in 
the  light  of  current  linguistic  theory  and  of  recent  advances  in  artificial 
intelligence  and  computational  linguistics.  Suggestions  for  consolidating 
the  linguistic  data  base  and  enhancing  the  parsing  facility  are  made  to 
advance  the  system  to  an  initial  operational  capability. 
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I . INTRODUCTION 

The  University  of  California  Chlnese-Engllsh  machine  translation 
research  project  was  initiated  in  1960  under  a National  Science  Foundation 
grant.  From  1967  to  1975  the  Project  was  supported  by  the  Department  of 
! the  Air  Force  (Rome  Air  Development  Center)  under  five  contracts.  These 

efforts  have  culminated  in  the  development  of  the  Syntactic  Analysis 
System  and  the  emerging  of  the  f^iince  system  for  translation  from  Chinese 
to  English. 

However,  owing  to  the  abrupt  termination  of  the  previous  contract 
in  1975  (F  3Q602-75-C-0059) , the  Quince  system  programs  were  left  incomplete 
• and  their  documentation  has  never  had  the  chance  of  being  adequately  carried 

out.  The  present  contract  (F  30602-77-C-0093) , covering  a period  of 
nine  months.  May  1977  to  January  1978,  provides  the  Project  with  an 

S 

opportunity  to  fully  document  the  Quince  system  programs  as  currently 
implemented  and  to  reassess  the  whole  translation  system  in  the  light  of 
recent  advances  in  linguistic  theory,  computational  linguistics,  artificial 
Intelligence,  and  computer  science.  The  documentation  and  inventory  of 
the  system  and  its  reassessment  will  provide  a smooth  transition  for 
any  ensuing  effort  in  Chlnese-Engllsh  machine  translation  research. 

The  documentation  and  inventory  of  the  system  and  its  reassessment  are 
the  two  major  sections  of  this  report.  Chapter  Two  will  be  devoted  to 
the  first  task,  and  Chapters  Three  and  Four  the  second. 

The  Quince  system,  conceived  as  an  integrated  Chinese-English 
machine  translation  system,  consists  of  two  major  components:  the  lin- 
guistic data  base  and  system  programs.  Chapter  Two  presents  a textual 

description  of  the  Quince  system  software,  keyed  to  the  Supplements.  It 

i 

concentrates  on  those  aspects  least  amenable  to  mechanical  documentation: 
overviews  of  the  system  as  a whole,  the  Interface  between  major  modules, 

Iand  Internal  data  structures.  This  chapter  also  includes  a body  of 

figures  and  tables  to  assist  the  reader  in  conceptualizing  the  translation 
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system  and  raiding  tha  Supplements. 

Thera  are  thirteen  Supplements  altogether.  Supplements  One  through 
Mine  detail  the  contents  of  nine  major  nodules  of  Quince  system  code. 

These  are  mechanically-produced  documentations  that  explicitly  extract 
the  storage  areas  and  the  calling  sequences  of  each  tub-program.  Each 
supplement  contains  three  volumes : the  Source  Code  Listing,  and  the 
Coding  Internals  Manual  volumes  I and  II. 

The  remaining  four  supplements  are  mechanically-produced  documenta- 
tions of  the  utilities  programs,  and  three  types  of  storage  documentation 
(loader  storage.  Common  Block,  and  field  function) . 

An  itemized  inventory  of  the  hardware  belonging  to  the  U.S.  govern- 
ment and  the  software  related  ;o  the  translation  system  is  presented  in 
the  Appendix  to  Chapter  TVo. 

Chapter  Three  reports  on  the  state  of  the  art  in  computational 
syntactic  description  of  natural  languages.  It  reviews  context-free 
grammars  and  points  out  their  inadequacies  for  handling  natural  languages 
and  their  clumsiness  for  human  consumption.  Van  Wljngaar den's  model 
for  context-free  grammars  with  structured  vocabulary  Is  presented  to 
remedy  the  Inadequacies  of  and  avoid  the  clumsiness  inherent  in  context- 
free  grammars.  The  history  of  the  use  of  structured  vocabulary  In  lin- 
guistic descriptions  is  traced  to  some  pre-Chomskyan  structural  linguists. 

Restrictions  on  the  generative  capacity  of  van  Wljngaarden  grammars 
are  discussed  to  arrive  at  a class  of  grammars  easier  to  write  and  easier 
to  parse.  Related  formalisms,  i.e.  Roster's  affix-grammars  and  Knuth's 
attribute-grammars,  are  also  explored  and  compared  with  other  types  of 
van  Wljngaarden  grammars.  Finally,  it  is  pointed  out  that  the  use  of 
structured  vocabulary  in  describing  the  Chinese  grammar  has  been  a topic 
of  research  at  the  Project  since  at  least  1970.  Future  tasks  and  research 
areas  for  the  Project  are  defined  and  strategies  suggested. 

In  the  last  chapter,  two  components  of  the  linguistic  data  base, 
the  dictionary  and  interlingual  transfer  rules,  are  examined.  To  ensure 
better  interactions  between  the  two  major  sub-components  of  the  grammar, 
the  syntactic  rules  and  lexicon,  more  lexical  features  are  needed  in  the 
future  graasuir.  Various  kinds  of  the  lexical  features,  their  nature 
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and  functions,  and  procedures  to  extract  the  lexical  information  from  the 
existing  grammar  codes  are  discussed. 

The  status  of  the  interlingual  transfer  rules  in  the  translation 
cycle  of  the  system  is  examined.  Different  types  of  the  interlingual 
transfer  rules  and  the  formalisms  to  be  used  in  the  future  are  also 
discussed. 

In  addition,  more  contrastive  lexical  and  syntactic  studies  between 
Chinese  and  English  and  contextual  analysis  are  recommended  in  the  future 
to  strengthen  the  two  components  above  of  the  linguistic  data  base. 
Strategies  to  achieve  this  goal  are  also  briefly  described.  Areas 
where  those  studies  will  lead  to  the  improvement  of  the  linguistic  data  base 
are  exemplified. 


II.  DESCRIPTION  AND  DOCUMENTATION  OF  THE 
QUINCE  MACHINE-TRANSLATION  SYSTEM 


1 . Introduction 

The  Quince  system  Is  an  Integrated  system  for  the  machine  translation 
of  scientific  texts  from  Chinese  to  English,  developed  at  the  Project  on 
Linguistic  Analysis,  UC  Berkeley.  It  includes  several  components:  a large 
corpus  of  linguistic  materials,  such  as  texts,  dictionary,  and  grammar;  an 
extensive  body  of  software  to  implement  the  translation;  and  a set  of  lin- 
guistic insights  about  how  Chinese  and  English  should  best  be  analysed, 
which  are  implicitly  incorporated  within  both  the  linguistic  and  programming 
materials. 

The  second  component  of  the  Quince  system,  the  body  of  computer  code 
written  to  perfofm  the  translation,  is  documented  in  this  chapter  and  in  the 
13  Supplements  appended  to  this  report.  It  cannot,  of  course,  be  totally 
separated  from  its  data  base  or  theoretical  approach.  This  component, 
however,  has  been  under-documented  in  previous  reports,  and  so  it  is  pre- 
sented here  in  isolation. 

Supplements  1-9  detail  the  contents  of  9 major  modules  of  Quince 
system  code.  Each  Supplement  contains  3 volumes:  the  Source  Code  Listing, 
and  the  Coding  Internals  Manual  in  2 volumes.  These  are  mechanically- 
produced  documentations  that  explicitly  extract  the  storage  areas  and  the 
calling  sequences  of  each  subprogram.  These  nine  modules  include  the  six 
'main*  modules,  two  'supplemental'  modules,  and  one  module  (the  Parse  Table 
Print  Module)  used  for  program  debugging  and  linguistic  research.  Each 
module  is  written  in  Fortran. 

The  remaining  four  Supplements  are  mechanically-produced  documentations 
of  the  utilities  programs  (Fortran  as  well  as  assembler),  and  three  types 
of  storage  documentation  (loader  storage.  Common  Block,  and  field  function). 

This  chapter  presents  a textual  description  of  the  Quince  system 
software,  keyed  to  the  Supplements.  It  concentrates  on  those  aspects  least 
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amenable  to  mechanical  documentation:  overviews  of  the  system  as  a whole, 
the  interface  between  major  modules,  and  internal  data  structures.  This 
chapter  also  includes  a body  of  appended  tables  to  assist  in  reading  the 
Supplements. 

To  large  extent,  the  Quince  system  can  be  'understood'  by  the  flow  and 
structures  of  the  data  at  different  stages  of  processing.  The  four  exter- 
nal data  files  are  the  ultimate  bases  of  the  translation;  of  these,  it  can 
be  said  that  the  text  is  passed  from  module  to  module,  sometimes  in  its 
entirety,  sometimes  segment-by-segment.  This  necessitates  temporary  inter- 
face files.  Likewise,  the  granmar  and  its  external/internal  code  conversion 
table  reside  in  binary  files.  Within  each  module,  segments  of  text  are 
manipulated  using  data  structures  which  are  field  function  tables;  these  in 
turn  reside  in  Common  Blocks,  that  permit  communication  within  and  between 
modules.  This  chapter  describes  these  data  and  storage  details,  as  a 
documentation-by-effect  of  the  Quince  system. 

The  Quince  system  documented  here  la  Version  .8;  this  is  to  indicate 
that  it  is  substantially  complete,  ready  for  documentation,  but  not  finished. 
The  translation  from  Chinese  to  English  would  be  greatly  improved  if  some  of 
the  data  bases  were  enhanced,  in  particular  by  the  addition  of  feature  in- 
formation to  the  dictionary  and  grammar;  this  would  also  be  utilised  by  the 
transfer  rules  (see  Chapter  4).  The  Quince  system  has  prepared  for  these 
proposed  changes  in  the  data  base,  but  until  they  are  made  available  the 
system  cannot  be  considered  complete.  Even  so,  the  Quince  system  in  its 
current  form  is  a major  research  result  in  the  field  of  machine  translation. 


2.  External  Data  Bases 


The  Quince  system  has  available  four  external  data  bases:  a Chinese- 
English  dictionary,  a set  of  grammar  rules  for  parsing  Chinese,  a set  of 
Chinese  telecode  substitutions,  and  a raw  text  to  be  translated.  (See 
Figure  i.) 
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Figure  1*  Quince  Modules  and  External  Data 
Bases.  Overview  of  4 external  data  bases.  2 
additional  data  bases.  6 oriaary  nodules.  1 
additional  module,  and  2 supplementary  modules 
in  the  Quince  system. 
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2.1  Dictionary 

The  Chinese -Engl  ial,  dictionary  exlsta  in  two  bodies:  the  Raster  C1IIDIC 
(80,000  entries),  and  the  mailer  PHYDIC  (40,000  entries).  PHYDIC  contains 
most  general-purpose  Chinese  words,  as  well  as  technical  terns  in  the  fields 
of  physics  and  mathematics.  It  is  a subset  of  CHIDXC,  which  in  practice  is 
too  cumbersome  to  maintain  during  research.  Both  dictionaries  are  in  the 
same  format.  Each  dictionary  entry  contains  information  on  the  grammatical 
category  (terminal  symbol)  of  the  item  in  Chinese  and  its  translation  in 
English,  keyed  to  telegraphic  codes. 

2.2  Graninar 

The  grammar  is  a set  of  context-free  production  rules,  or  source  rules, 
which  define  the  surface  structure  of  Chinese*,  it  is  necessary  to  fully  parse 
each  segment  of  Chinese  text  before  translation,  as  the  structures  of  Chinese 
and  English  are  so  different.  The  grammar  actually  consists  of  5 subgrammars, 
each  of  which  handles  a particular  level  in  a parse-tree.  These  subgrammajs 
are  usually  applied  in  sequential  order.  The  sire  of  each  subgramtnar  is  as 
follows: 


Grammar 

1 - 

- 125  rules 

Grammar 

2 - 

- 500  rules 

Graranar 

3 - 

- 2400  rules 

Grammar 

4 - 

- 340  rules 

Grammar 

5 - 

- 2750  rules 

Many  source  rules  are  included  in  more  than  one  subgrammar;  in  particular, 
Grammar  3 and  Grammar  5 largely  overlap. 

2.3  Telecode  Substitution  Table 

For  every  Chinese  character  there  Is  a corresponding  4-digit  telecode: 
this  is  the  coding  scheme  used  in  the  dictionary  and  in  the  text.  There  is, 
however,  a set  of  characters  that  optionally  substitute  for  one  or  more 
other  characters.  Accordingly,  whenever  potential  substitution  characters 
(l.e.  telecodes)  arc  encountered,  their  possible  corresponding  character (s) 
must  also  be  made  subject  to  dictionary  lookup.  These  correspondences  are 
found  in  the  external  telecode  substitution  table. 


2.4  Text 

The  Chinese  text  to  be  translated  includes  565,000  characters  from 
physics  and  mathematics  texts;  it  is  broken  down  into  subtexts  for  conve- 
nience. The  text  is  not  pre-edited:  abbreviations  are  not  expanded,  tele- 
code substitutions  are  not  corrected  for,  different  number-systems  (e.g. 
classical,  modern  and  Western)  are  left  to  co-exist,  etc.  In  particular, 
the  inconsistently-applied  Chinese  punctuation  marks  remain  as  in  the  origin- 
al. The  only  additions  are  position-in-volume  Information  for  maintenance 
and  identification  purposes. 

2.5  Formats 

The  four  external  data  bases  are  maintained  in  an  "external"  format 
suitable  for  human  maintenance:  character  strings,  mnemonic  category  symbols, 
pronunciation  information  together  with  the  telecodes,  etc.  Each  file  has 

its,  own  set  of  software  maintenance  routines,  peripheral  to  the  Quince 

system  and  not  here  documented. 

Because  there  have  been  no  random-access  facilities  on  the  CDC  6400 
at  the  University  of  California  at  Berkeley,  these  external  files  are  fixed- 
field  sequential  files,  stored  on  magnetic  tape.  During  a previous  contract 
period  it  seemed  that  such  hardware  facilities  might  become  available; 
accordingly  a random-access  dictionary  was  designed,  and  a random  dictionary 
software  module  was  written  to  perform  both  telecode  substitution  and 
dictionary  lookup.  This  module  is  one  of  two  "supplemental"  modules  fully 
documented  in  this  report.  Were  the  random-access  storage  to  become  avail- 
able, this  random  dictionary  module  would  replace  the  present  vestigand 
formation  and  dictionary  lookup  modules,  and  the  present  sequential  diction- 
ary format  would  become  obsolete  (see  Figure  1). 

As  presently  implemented,  two  of  these  external  files  are  converted  to 
an  "internal"  format  before  participating  in  the  translation  process:  the 
table  of  telscode  substitutions  becomes  an  internal  hash  table  in  the  module 
INVEST,  and  the  gra*sar  source  rules  are  adapted  into  an  autcaatically- 
allocated  table  by  the  module  ADAPTG . The  dictionary  remains  in  external 
format;  it  is  read  through  in  one  pass  during  the  lookup  process.  The  text 
first  undergone  pre-editing,  or  canonisation,  and  then  is  successively 
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broken  down  Into  mailer  translation  units;  each  unit  Is  read  In  In  external 
f onset,  proceesed  In  internal  foraat,  and  then  saved  on  an  intermediary 
interface  file  in  external  foraat, 

2.6  Additional  Data  Bases 

As  originally  designed,  the  Quince  system  included  two  more  external 
data  basea:  a set  of  Interlingual  transfer  specifications,  and  information 
on  English  morphology.  These  would,  respectively,  govern  the  transformation 
or  transfer  of  a Chinese  parse  tree  to  an  English  tree,  and  tidy  up  the 
surface  of  the  final  string,  by  e.g.  agreement  in  number  and  tense. 

At  present,  these  two  data  bases  do  not  exist.  The  transfers  have 
been  approximated  by  reducing  them  to  two  tree  operations:  deletion  and  re* 
versal  of  nodes.  These  are  triggered  by  the  application  of  specific  grammar 
rules,  which  leave  "transfer"  codes  on  the  node  labels  of  the  Chinese  tree 
during  parsing.  In  anticipation  of  a separate  body  of  transfer  specifics* 
tlons,  the  Quince  system  does  not  include  a transfer  module.  These  is 
instead  a temporary  Interface  module  with  the  old  SAS  Syntactic  Analysis 
System,  which  performs  these  limited  transfers  and  also  provides  plotting 
capabilities  for  the  reeultlng  trees  on  the  Calcomp  plotter.  The  trees  are 
then  returned  to  the  Quince  system  for  string  extraction.  This  SAS  compati* 
blllty  module  Is  the  second"supplemental"  module  documented  in  this  report. 

As  presently  implemented,  each  English  word  assumes  its  root  dictionary 
fora  during  string  extraction,  without  morphological  adjustment.  The  result- 
ing translation  is  rough  but  reasonably  comprehensible. 

Chapter  A outlines  future  plans  for  these  two  additional  data  bases. 

Once  the  Interlingual  transfer  rules  are  available,  an  additional  Quince 
module  transfer  will  modify  the  Chinese  parse  tree  prior  to  string  extraction. 
The  string  extraction  module  will  be  expanded  to  include  information  from 
the  morphological  rules. 

Figure  l illustrates  the  relationship  between  the  four  external  data 
bases  and  the  six  main  Quince  modules;  it  also  includes  the  two  supplemental 
modules,  as  well  as  two  future  data  bases. 


3.  Internal  flat*  Storage 

Daring  translation,  the  six  Quince  nodules  process  the  text  in  two 
nodes:  text-by-text  (batch)  and  segment-by-segment . In  batch  node,  the 
entire  body  of  text  must  pass  through  one  module  before  entering  the  next; 
this  is  e.g.  the  case  while  preparing  for  dictionary  lookup,  since  the  entire 
text  is  looked  up  in  the  dictionary  in  one  pass.  In  segment-by-segment 
processing,  a single  segment  (translation  unit)  is  passed  through  several 
modules.  Obviously  there  is  segment-by-segment  processing  within  each  of 
the  six  modules  — the  distinction  only  becomes  useful  in  discussing  the 
Interaction  between  modules. 

The  Quince  modules  pass  Information  between  each  other  in  three  possi- 
ble forms:  interface  files,  binary  files,  and  module  storage  tables.  These 
differ  both  in  size  and  function,  and  a»:e  determined  largely  by  the  mode  of 
processing. 

During  batch  mode,  the  text  is  passed  from  module  to  module  in  the  form 
of  external  interface  files.  These  are  of  variable  length  (depending  on  the 
size  of  the  text),  and  are  written  on  tape  as  sequential  character  files. 

There  are  5 Interface  files;  each  is  written  as  output  by  one  module,  and 
rewound  for  use  as  input  to  the  next  module. 

Binary  files  are  used  for  two  large  Internal  bodies  of  data  which  can- 
not be  accomoodated  in-core.  One  is  the  table  of  category  (terminal)  symbols, 
which  relates  the  external  strings  coding  these  symbols  to  their  Internal 
hash-table  codes;  this  table  is  used  during  both  dictionary  lookup  and 
granmar  adaptation,  as  both  the  dictionary  and  grammar  source  rules  are  ex- 
ternal data  bases.  The  second  table  is  that  of  the  adapted  grammar  Itself  — 
this  Includes  all  five  subgranmars,  of  which  only  one  is  in  use  during  any 
one  parse. 

Once  built  within  the  translation  run,  these  two  tables  are  variously 
stored  as  binary  files  on  tape,  or  as  common  files  on  system  storage  — 
these  storage  allocations  are  performed  automatically,  depending  on  availa- 
bility of  space,  to  eliminate  the  repeated  reconstruction  of  these  tables  and 
to  minimize  retrieval  time. 

The  module  storage  tables  are  straightforward  Common  Blocks.  They 
primarily  provide  for  shared  storage  among  the  sub-programs  within  each  of 
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the  6 modules;  but  they  occasionally  are  shared  by  several  modules,  especial- 
ly during  segment -by-segment  processing. 

Figure  2 outlines  the  5 interface  files  during  batch  processing,  and 
their  relation  to  the  Quince  modules.  Figure  3 presents  the  flow  of  control 
in  those  modules  that  perform  segment-by-segment  processing;  it  includes  the 
two  binary  files.  Table  l indicates  which  module  storage  tables  (Common 
Blocks)  are  used  by  each  of  the  Quince  modules. 
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figure  2.  Quince  Modules  and  Interface  Files. 
Overview  of  5 interface  filest  4 external  data 
bases,  and  6 orimarv  modules  in  the  Quince 
svstem.  {Note  that  the  IOOKUF  Drimarv  module 
is  divided  into  its  submodules  3SLSCTV  and 
WINNOW  for  claritv.3 
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4.  Interface  Files 


4 . 1 Canonized  Text  File 

Before  the  raw  text  can  be  translated,  each  of  its  sub-texts  must  be 
pre-edited  into  smaller  units,  called  sentences;  these  correspond  roughly  to 
English  "sentences".  Because  of  differences  between  English  and  Chinese 
punctuation  practices,  each  Chinese  "sentence"  typically  contains  several  of 
these  smaller  sentence  units.  In  addition,  special  telecodes  and  characters 
such  as  textual  identifiers,  parentheses,  footnotes,  etc.  must  be  analysed 
and  related  to  the  sentence.  All  these  processes  use  an  internal  table  of 
special  telecodes  in  the  Quince  module  CANONZ;  this  module  outputs  a 
canonized  text  file,  in  which  the  string  of  telecodes  is  broken  up  into 
sentences  within  the  text. 

4 . 2 Segment ed  Text  FI 1 e 

After  the  preliminary  editing  during  canonization,  each  sentence  is 
divided  still  further  into  segments;  in  practice,  these  segments  will  corre- 
spond to  parse-units  during  parsing.  In  the  dictionary  each  lexical  entry 
may  consist  of  1-7  telecodes  (including  e.g.  idioms  or  compounds);  one 
characteristic  of  a segment  is  that  a lexical  entry  will  not  extend  past  the 
segment  boundaries.  These  boundaries  include  punctuation  marks  (period, 
parentheses,  commas),  as  well  as  special  syntax-marking  Chinese  characters. 

The  module  INVEST  outputs  a sej^mejvted  _text _£i_l_e,  this  is  a reformula- 
tion of  the  canonized  text  file  in  terms  of  segments  rather  than  sentences. 
This  file  does  not  participate  in  the  dictionary  lookup  or  parsing  of  that 
segment:  it  is  kept  around  in  text  order  until  STRF.XT,  for  research  purposes, 
so  that  the  English  and  Chinese  strings  may  be  manually  compared. 

4.3  V e s t_ijjand_  F i 1 e 

As  each  segment  is  determined,  it  is  necessary  to  make  a list  of  all 
its  lexical  items;  these  are  subsequently  subject  to  dictionary  lookup. 
Determining  worJ  boundaries  in  Chinese  is,  however,  not  a trivial  task,  as 
each  dictionary  entry  consists  of  a variable  number  of  telecodes.  Thus 
from  each  segment  are  calculated  all  the  vestigands : these  are  strings  of 
from  l to  7 telecodes  in  length,  any  of  which  might  be  a valid  lexical  item. 

lr> 


Thus  the  module  INVEST  also  outputs  an  Interface  vestlftand  file. 

This  is  the  segmented  text  file  formulated  in  veatlgands  (rather  than  In  tele- 
codes) for  each  segment. 

4.4  Selected  Dictionary  File 

The  vestlgand  file  la  first  sorted  into  dictionary  order  (using  the 
file/sort  utility);  and  then  the  sub-module  SELECTV  (first  half  of  the  LOOKUP 
process)  searches  for  each  vestigand  in  the  external  dictionary.  For  each 
vestlgand  found  (now  a lexical  item),  it  records  all  the  dictionary  informa- 
tion except  the  romanizatlon  (pronunciation),  in  addition  to  the  information 
from  the  lnputed  vestlgand  file,  onto  the  selected  dictionary  file.  The 
selected  dictionary  file  thus  Includes  fewer  records  than  the  vestlgand 
file,  since  many  vestlgands  were  not  found  in  the  dictionary;  but  each 
existing  record  Includes  more  Information.  This  file  is  then  sorted  back  in- 
to text  order. 

A. 5 Sentence  Dictionary  File 

Since  many  of  the  postulated  vestlgands  have  been  rejected  during 
dictionary  lookup  in  SELECTV,  it  now  becomes  necessary  to  reconstitute  each 
segment  in  terms  of  valid  lexical  items.  This  occurs  in  the  sub-module 
WINNOW  (second  half  of  LOOKUP).  During  winnowing,  the  shortest  complete 
paths  are  found  which  connect  the  beginning  and  end  of  each  segment;  for  each 
rejected  vestlgand,  every  path  which  originally  Included  it  must  be  discarded 
— and  this  in  turn  may  eliminate  some  occurrances  of  otherwise  acceptlble 
lexical  items,  since  they  no  longer  "occur"  in  the  segment.  The  sub-module 
WINNOW  thus  outputs  the  sentence  dictionary  file:  each  record  contains  all 
the  information  of  the  selected  dictionary  file,  in  nearly  identical  format, 
but  there  are  fewer  records.  Also,  the  grammatical  code  for  each  lexical 
item  is  additionally  expressed  in  terms  of  its  category  symbol,  its  Internal 
code  which  will  key  it  to  an  internal  hash  table  during  parsing. 

A. 6 Format 

The  exact  record  format  for  each  of  the  Interface  files  is  contained 
and  documented  in  the  Interface  file  definitions;  see  Section  5.1  and 
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Supplement  12. 
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4.7  Random  Dictionary  Module 

It  should  be  noted  that  there  would  no  longer  be  any  Interface  files  If 
(as  originally  designed)  the  random  dictionary  module,  witu  its  associated 
random-access  dictionary,  were  to  replace  the  modules  INVEST  and  LOOKUP. 

This  is  because  batch  processing  would  no  longer  be  necessary:  after  the  raw 
text  had  been  pre-edited  by  CANONZ,  each  vestlgand  could  be  extracted  and 
made  subject  to  Immediate  dictionary  lookup,  and  each  segment  parsed  as  soon 
as  it  was  formed.  This  is  illustrated  in  Figure  4.  It  should  be  remembered, 
however,  that  the  module  random  dictionary  has  not  been  able  to  be  implemented, 
so  that  as  presently  documented  its  input  and  output  files  are  not  yet 
defined  in  the  code. 
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figure  4.  Random  ^Dictionary  Module.  With  RA^DIC 
onlv  the  raw  Chinese  text  would  be  nrocessed  in 
batoh  mode  (CANONZ)j  all  subsequent  processing 
would  be  segment -bv-eenment. 
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1 . Common  Bl  5>  c 

The  Mix  main  Quince  module*  plus  the  two  supplemental  modules  together 
use  21  common  blocks  — shared  storage  areas.  These  blocks  are  functionally 
of  two  kinds:  interface  file  definitions  and  module  storage  tables. 

1.1  Interface  Fite  Definitions 

These  six  areas  are  used  for  the  five  interface  files  plus  the  external- 
format  dictionary,  Kach  area  includes  buffer  space  for  one  file  record,  plus 
constants  defining  each  record  and  variable  names  referencing  each  field  in 
the  record.  The  areas  do  not  Include  any  "working  space"  for  processing  the 
files:  they  simply  "define"  the  file,  and  hence  these  interface  file  defini- 
tions are  used  by  both  the  outfitting  and  Inpnting  module  on  either  side  of 
the  Interface  file. 

Rocauae  the  Interface  files  are  sequential  character  files,  there  are 
no  flelda  of  less  than  one  character  in  length,  and  no  hash  tables  or  other 
special  allocation  involved  --  thus  the  Interface  definition  tables  contain 
no  tables  accessed  through  field  functions. 

1.2  Module  Storage  Tables 

The  remaining  17  common  Mocks  are  used  primarllv  for  shared  constants 
and  working  space  within  the  various  subprograms  that  make  up  each  of  the  8 
Quince  modules,  although  a few  Mocks  are  shared  hv  several  main  modules. 

This  is  illustrated  in  Table  1. 

Two  of  these  co,<uion  Mocks  (l.iH'OKF,  ami  NMCORKl  are  alwavs  resident. 

They  are  part  of  the  d ’hugging  and  optimising  capability  of  the  Quince  Pro- 
gram-Writing System,  and  are  not  considered  part  of  the  ijuiuce  translation 
system  in  the  following  discussion. 

Most  of  the  remaining  l *>  common  Mocks  include  tables  with  special 
storage  requirements:  fields  of  less  than  l character  (blt-levell,  hash 
tables  with  numeric  or  character  kevs,  floating  tables  in  K.CS,  dyusmically 
sllocsted  tables,  etc.  These  are  the  tables  accessed  through  the  so-celled 
field  functions.  The  28  field  function  tables  sre  located  in  the  11  module 
storsge  tables  as  shown  in  Table  2. 
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5.3  Table  Name* 


In  the  Quince  system  source  listings,  and  in  the  detailed  program  docu- 
mentation which  supplements  this  report,  these  varoue  tables  have  different 
names  in  different  contexts.  This  is  a consequence  partly  of  the  Quince 
Program  Writing  System  (outlined  in  Figure  5),  and  partly  of  attempts  to 
write  system-independent  code.  Include  e.g.  separate  names  for  Fortran  arrays 
and  Coimaon  Blocks.  These  names  are  related  to  each  other  in  a reasonably 
systematic  way,  as  detailed  in  Table  3.  In  the  present  chapter  we  will 
always  use  the  "name-1"  in  Table  3 — the  input  to  the  CBA  Common  Block  Allo- 
cator and  the  input  to  the  FFN  Field  Function  Writer  — when  referring  to  the 
interface  file  definitions,  module-storage  tables,  and  tables  accessed 
through  field  functions. 
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5.4  Format,  Block  Data 

Both  the  interface  file  definitions  and  module  storage  tables  are 
documented  in  the  Common  Block  Allocator  definitions  in  Supplement  12.  The 
constants  from  each  table  are  extracted  to  build  Block  Data  subprograms,  for 
load-time  initializations;  these  are  documented  in  Supplement  11.  Table  4 
lists  the  Block  Data  subprograms  and  their  corresponding  common  blocks. 

^ ^ Field  Function  Tables 

The  28  field  function  tables  are  organized  so  that  any  field  in  any 
table,  regardless  of  internal  format,  can  be  accessed  by  Fortran  in  a trans- 
parent and  straightforward  way.  The  variable  names  and  constants  associated 
with  each  table  are  documented  in  Supplement  13.  The  fields  — their  names 
and  size  — are  illustrated  in  Figure  6A-60;  these  are  grouped  according  to 
which  storage  module  table  they  are  located  in. 

These  field  function  tables  appear  in  the  code  as  follows.  Consider 
the  fields  IVPSG  and  IVNSC  in  field  function  table  IVSFFN,  as  illustrated  in 
Figure  6A.  These  are  both  pointers,  one  to  the  previous  segment  and  one  to 
the  next  segment.  The  following  code  would  reverse  two  segments  by  inter- 
changing the  pointers: 

DUMMY  - IVrSC(I) 

IVPSG (1)  - IVNSC (I) 

IVNSG(I)  - DUMMY 

This  example  is  written  in  GASP;  here  is  the  FORTRAN  code  generated  by  the 
GASP  Program  Writer: 

DUMMY  - IVPSG (T) 

CALL  IVPSG0  (NULL.  T.  TVNSO(I)) 

CALL  IVNSC0  (NULL,  DUMMY) 

The  subroutine  names  IVPSO0  and  TVNSG0  have  been  generated  from  the  function 
names  IVPSG  and  IVNSG  to  form  a set /retrieve  pair  of  field  functions.  Tn 
Supplements  1-9,  all  field  functions  are  identified  as  "Calls  Made  to  Routines 
Outside  Module". 


Within  a single  module  storage  area  there  may  be  several  field  function 
tables.  These  are  often  linked  with  each  other  by  pointers  and  pointers-to- 
polnters,  and  processing  consists  primarily  of  moving  these  pointers  around. 

It  is  generally  the  case,  however,  that  each  pointer  always  links  with  a 
certain  type  of  node,  i.e.  with  a certain  other  field  function  table,  although 
the  particular  node  in  question  may  change  within  the  table. 

Figure  7 presents  come  of  the  more  complex  data  structures  used  in  the 
Quince  system.  The  field  names  and  field  function  table  names  are  as  in 
Figure  6A-60. 
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At  almost  every  stage  of  the  translation  process,  the  text  must  be 
considered  to  have  the  data  structure  of  a lattice,  rather  than  a string. 

This  is  due  to  the  linguistic  nature  of  Chinese  — there  are  so  many  ambi- 
guities in  its  analysis.  Telecode  substitution  introduces  alternate  readings 
for  each  telecode,  vestigands  introduce  alternate  combinations  of  telecodes, 
multiple  possible  category  symbols  for  each  lexical  item  introduce  alternate 
parse  trees.  For  this  reason,  at  almost  every  stage  of  translation,  and 
within  each  module,  data  structures  such  as  those  in  Figure  7 are  used. 

Their  manipulation  takes  up  much  of  the  program  logic  within  the  Quince 
modules;  accordingly,  the  modules  themselves  will  not  be  further  documented 
in  this  chapter.  The  reader  is  referred  to  Supplements  1-9  for  further 
details  on  the  modulus  themselves. 

6.  Utilities 

The  Quince  utility  modules  include  those  subprograms  which  are  machine- 
or  system-dependent,  but  which  (generally)  do  not  manipulate  f ields-within 
a-word  (this  capability  is  provided  by  the  field  functions). 

The  utility  source  listings  are  presented  in  Supplement  10;  they  do 
not,  however,  contain  as  much  internal  documentation  as  do  the  other  Quince 
modules,  and  so  they  are  summarized  in  this  section. 

The  utilities  provide  support  in  three  general  areas:  input/output , 
format  conversion,  and  debugging  and  optimization.  The  10  routines  handle 
files  on  random-access  storage,  Extended  Core  Storage,  and  system  files 
(coded  and  binary),  as  well  as  reading  the  system  registers.  They  include 
the  READS  and  WRITES  routines,  which  replace  the  FORTRAN  read  and  write 
statements  for  coded  serial  files. 

The  format  conversion  routines  convert  and  shift  among  binary,  decimal, 
integer,  display  character,  and  FORTRAN  A1  format.  They  provide  justifica- 
tion, and  handle  the  only  fixed-format  field  in  the  Q"ince  system:  packed 
machine-depedent  telecodes. 

The  debugging  and  optimization  routines  are  the  most  conspicuous  in 
the  Quince  module  source  listings,  as  they  appear  in  every  routine  to  permit 
timing,  tracebacks,  and  counts  of  entry-within-each-routine ; they  are 
specially  implemented  so  as  to  catch  fatal  FORTRAN  errors  before  they  produce 
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a system  clash,  so  that  traceback  can  be  completed.  It  is  these  routines 
that  use  the  resident  Common  Blocks  LOCORE  and  NMCORE.  In  a full  production 
version  of  the  Quince  system,  many  of  these  modules  would  be  removed  com- 
pletely; at  present,  they  are  controlled  by  switches  on  the  system  registers. 
Table  5 lists  the  utility  routines  by  their  function. 

7 . Additional  Documentation 

The  Quince  system  has  also  been  documented  in  previous  final  reports. 
These  tend  toward  providing  a description  of  the  processing  in  linguistic, 
rather  than  in  computer,  terms;  however,  much  of  the  information  is  still 
current.  In  particular,  (3)  presents  the  coding  conventions  for  identifying 
the  special  telecodes,  and  an  outline  of  the  procedures  used  in  text  prepara- 
tion. (4)  outlines  the  "steps"  of  machine  translation. 

The  Quince  Program  Writer  has  a full  description  in  (4),  which  also 
describes  the  plotting  capabilities  available  through  the  SAS  Compatibility 
module. 

There  are  also  several  unpublished  papers  available  from  the  Project. 
(1)  is  a manual  for  writing  GASP,  the  structured  programming  language  used 
as  the  source  language  for  all  the  Quince  modules;  this  source  code  is 
translated  into  Fortran  by  the  GASP  translator.  (2)  is  a description  of  the 
theoretical  approach  used  in  the  PARSER  module. 
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Figure  64.  Fields  used  in  the  Field  Funotion  fables  of 
Module  Storage  fable  IVCORE 
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figure  68.  fields  used  in  the  field  funotion  Tables  of 
Nodule  Storage  Table  SBCORV 
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Piffure  6D.  Pi eld a used  in  the  Pi  eld  “’unction  tables  of 
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Figure  hH-  Pielde  need  in  the  (field  (function  ^abiee  of 
Module  Storage  ’''able  PRCOWK  (oont.) 
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figure  61.  fields  used  in  the  field  function  fables  of 
Module  Storage  fable  ARCQRK 
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Figure  6J.  Fields  used  in  the  Field  Funotion  Fables  of 
Module  Storage  Fable  ARCQRW  (cont.) 
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Figure  6K.  Fields  used  in  the  Field  Function  Tables  of 
Module  Storage  Table  DTCORE 
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Figure  60.  Fields  used  in  the  Field  Punotion  Tables  of 
Modules  Storage  Table  STCOPY 
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Table  1.  Module  Storage  Tables  Resident 
for  Each  of  the  Quince  Modules 
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INVEST  IVCORE  IVSFFN  - invest  segment  table 

IVTFFN  - invest  text  table 

SBCORE  SBHFFN  - telecode  substitution 

hash  table 

SBSFFN  - telecode  substitution 
table 

SBTFFN  - telecode  list  table  for 
substitutions 


LOOKUP  LUCORE 


SLCORE 

CSCORE 


LPTFFN  - lookup  spans  table 
LQHFFN  - lookup  span  queue  table 
LQTFFN  - lookup  queue  tree  table 
LSPFFN  - lookup  span  table 


CSIFFN  - category  symbol  data  table 
CTHFFN  - category  symbol  hash  table 


ADAPTG  SRCORE 

ADAPTG/  AGCORE 

PARSER 

PARSER  PRCORE 


ARCORE 


ARLFFN  - adapted  rules  table 
QTRFFN  - winnowing  queue  tree  table 

CLTFFN  - lattice  parse-time  aux,  table 
CRCFFN  - constitute  parse-time 
auxiliary  table 

CSUFFN  - category  symbol  parse-time 
data  table 

SPQFFN  - sentence  position  queue 
heads  table 

ALTFFN  - archive  lattice  table 
ARCFFN  - archive  constitute  table 
LTPFFN  - lattice  list  next  list 
pointers 


Table  2.  Field  Function  Tables  Located  in  F.ach 
of  the  Module  Storage  Tables 
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STREXT 

STCORE 

STRFFN  - master  tree  table 

RANDIC 

DTCORE 

CDWFFN  - word  nodes  for  dict- 
ionary page 

SGDATA 

SGATFFN  - pointer  to  active 
telecodes 

SGSDFFN  - telecode  span  list 
SGTTFFN  - list  of  active  telecodes 

SGTABS 

SGSNFFN  - substitutes/alternatives 
for  telecodes 

SGSPFFN  - telecodes  for  substitution/ 
alternation 

SGSTFFN  - segmentation  breaks  table 

ALCHEM 

STRSAS 

Table  2.  (cont.) 


52 
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Module  Storage  Table  Field  Function  Table 


name-1 

name -2 

name-3 

name-1 

name-2 

name-3 

CNCORE 

CNCOR 

CNBLK 

IVCORE 

IVCOR 

IVBLK 

IVSFFN 

IVSFF 

IVSGTB 

IVTFFN 

IVTFF 

IVTXTB 

SBCORE 

SBTCO 

SBTBLK 

SBHFFN 

SBHSH 

SBHTAB 

SBSFFN 

SUBST 

SBSTAB 

SBTFFN 

SBTLC 

SBTTAB 

LUCORE 

LUCOR 

LUBLK 

LPTFFN 

LPTFF 

LPTXTB 

LQHFFN 

LQHFF 

LQHDTB 

LQTFFN 

LQTFF 

LQTRTB 

LSPFFN 

LSPFF 

LSPNTB 

SLCORE 

SLCOR 

SLBLK 

AG CORE 

AGCOR 

AGBLK 

ARLFFN 

ARLFF 

ARLTAB 

QTRFFN 

QTRFF 

QTRTAB 

CSCORE 

CSCOR 

CSBLK 

CSIFFN 

CSEQU 

SCIBAS 

CTHFFN 

CSEQII 

CTHTAB 

SRCORE 

SRCOR 

SRBLK 

PRCORE 

PRCOR 

PRBLK 

CLTFFN 

CLTFF 

CLTCTB 

CRCFFN 

CRCFF 

CRCNST 

CSUFFN 

CSUFF 

CSUSTB 

SPQFFN 

SPQFF 

SPQHDS 

ARCORE 

ARCOR 

ARBLK 

ALTFFN 

ALTFF 

ALTCTB 

ARCFFN 

ARCFF 

ARCNST 

LTPFFN 

LTNFF 

LTCNXT 

DTCORE 

DCTGM 

DCTOR 

DCWFFN 

DWDEQ 

DCTWDS 

SGDATA 

SGDCM 

SGDCOR 

SGATFFN 

ACTEQ 

ACTNDS 

SGSDFFN 

SPNEQ 

SPNNDS 

SGTTFFN 

TXTEQ 

TXTNDS 

SGTABS 

SGTCM 

SGTCOR 

SGSNFFN 

SBPEQ 

SBPTRS 

SGSPFFN 

SBPEQ 

SBPTRS 

SGSTFFN 

BKSEQ 

SEGBKS 

STCORE 

STCOR 

STRESD 

STRFFN 

STABL 

STREE 

STRSAS 

STRSA 

STRTEM 

Table  3.  Names  of  the 

Module 

LOCORE 

NMCORE 

LOCOR 

HMCOR 

RESDNT 

NMBLK 

Storage  Tables,  Field 
Tables,  and  Interface 
Definitions 

Function 

File 
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Interface  File  Definitions 


CZTDEF 

CZTCR 

CZTBLK 

SGTDEF 

SGTCR 

SGTBLK 

VSTDEF 

VSTCR 

VSTBLK 

PDCDEF 

PDCCR 

PDCBLK 

SLDDEF 

SLDCR 

SLDBLK 

SDCDEF 

SDCCR 

SDCBLK 

Table  3.  (cont.) 
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Interface  File  Definitions 


CZTDEF 

BDCZT1 

PDCDEF 

BDPDCl 

DSCDEF 

BDSDC1 

SGTDEF 

BDSGT1 

SLDDEF 

BDSLDl 

VSTDEF 

BDVST1 

Module  Storage  Tables 


AGCORE 

BDAGC1 

ARCORE 

BDARC1 

CNCORE 

CDCNZ1 

CSCORE 

BDCS1 

IVCORE 

BDIVC1 

LUCORE 

BDLUC1 

PRCORE 

BDPRS1 

SBCORE 

BDSET1 

SLCORE 

BDSLCl 

SRCORE 

BDSRL1 

STCORE 

BDSTR1 

STRSAS 

BDSTS1 

Resident  Tables 

LOCORE 

BDLOC1 

NMCORE  BDNAME 

Table  U»  Block  Data  Subprograms 
for  Each  of  the  Common  Blocks 
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I/O  Routines 

A.  random-access  disk  I/O  (COMTASS) 

CLDISC 

LRDISC 

MTDISC 

NMDISC 

OPDISC 

RDDISC 

WRDISC 

B.  ECS  storage  (COMPASS) 

RE 

WE 

C.  System  registers  (COMPASS) 

READRG 

WRITRG 

D.  system  files 

1)  binary  files 
OPBIN 
CLBIN 
REWINB 
WREOFB 
RDBIN 
WRBIN 

2)  coded  files 
OPCOD 
CLCOD 
REWINC 
WREOFC 
RDCODl 
WRCOD1 

Table  5.  Utility  Routines  Classified 
by  Function 
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3)  serial  coded  files 
READS 
WRITES 

II.  Format  Conversion  Routines  (COMPASS) 

A.  packed  machine-dependent  telecodes 

ARTOTC 

TCTOAR 

TCCOMP 

B.  binary/decimal/integer/display/Al 

BTOD2A 

BT0024 

ITOC 

CTO  I 

ARFORM 

RAFORM 

C.  non-FORTRAN  character 

NFORTC 

D.  justification 

UUSTC 
RJUSTC 
LJUSTI 
R JUST I 

III.  Other  Routines 

A.  collating  sequences 

CllCODE 

CODECIl 

B.  shift 

LLS 

LRS 

Table  5.  (cont.) 
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C.  field  insertion 

INBUF 

D.  string  equality 

STREQ 

IV.  Debugging  and  Optimisation  Routines 

A.  Traceback 

TRCBAK 

DBOUT 

ERROR 

PRTBE 

PRTBP 

PRTBS 

NARCS 

RECOVR 

B.  Timing 

TIMER 
PRC RAF 

C.  Machine  environment 

CONFIG 


Table  5.  (cont.) 
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APPENDIX:  HARDWARE,  SOFTWARE  AND  DATA  INVENTORY 


1.  Hirdwn  Inventory 

1.1  Teletype  KSR-37  Terminal,  with  upper  and  lower  case,  150  baud.  At 
present  It  can  be  connected  via  modem  with  the  CDC  6400  at  the  Lawrence 
Berkeley  Radiation  Laboratory,  and  also  into  the  ARPANET. 

1.2  Chinese  Teleprinter  Model  600D,  2 sets.  Each  set  has  a configuration 
consisting  of  a Chinese  character  keyboard,  a printing  unit  for  direct 
hard-copy  output,  a paper  tape  punch  and  a reader,  and  a slightly 
modified  atandard  teletype  with  a standard  English  keyboard. 

t.3  DEC  DL-ilE  Asynchronous  Serial  Interface  — for  charactet  display  on 
DEC  PDP  11/20  - VT-11  display. 

2.  Software  Inventory 

2. 1 Quince  System 

The  Qu.ce  system  modules  are  contained  in  three  libraries,  each  of 
which  Is  stored  on  tape  and  maintained  In  both  source  and  object  form 
In  a set  of  three  cycles  each. 

1.  DMLIB,  the  Data  Management  Library 

a.  field  function  definitions 

b.  field  functions 

c.  common  block  allocator  definitions 

2.  UTLIB,  the  Utilities  Library  — all  system-  or  machine-' Independent 
routines,  both  In  assembler  (COMTASS)  and  Fortran 

3.  QULIB,  the  Quince  Library 

a.  GASP  source  of  all  system-independent  subprograms 

b.  Block  Data  subprograms 

2.2  Program  Writing  System 

The  Program-writing  system  is  a body  of  locally-written  software 
aids  tor  creating  and  maintaining  large  bodies  of  code.  Each  is  stored 
on  Its  own  tape. 

1.  GASP  Fortran  Translator 

2.  Fl-jld  Function  Writer 

3.  Common  Block  Allocator 
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2.3  Other  Software 


1.  SAS  --  Syntactic  Analysis  System  — predecessor  to  Quince  system 

2.  Plot  Routines 

The  plot  routines  permit  the  graphic  display  of  trees  and  Chinese 
characters  for  research  purposes;  they  are  actually  part  of  the 
previous  Syntactic  Analysis  System,  with  data  interface  via  the 
SAS  Compatibility  Module  (ALCHEM)  of  the  Quince  system. 


a.  plotting  subprograms 

b.  vector  definitions  of  7000  Chinese  characters 

3.  Data  Inventory 

3 . 1 Chinese-English  Dictionaries  on  Tape 

1.  CHIDIC  (approximately  80,000  records) 

2.  PHYDIC  (approximately  40,000  records) 

3.  McGraw-Hill  Scientific  Dictionary  (partial)  on  5 reels 

4.  DOD  Chinese-English  Scientific  Dictionary  (approximately  500,000 
records)  on  4 reels 

5.  Special  sorts  on  CHIDIC 

(i)  one-telecode  entries 

(ii)  long  entries  (more  than  3 telecodes) 

(iii)  reverse  telecode  sort 

(iv)  grammar  code  sort 

6.  Special  sort  on  PHYDIC 
(i)  grammar  code  sort 

^ • 2 Chinese  Grammars 

The  Chinese  grammar  consists  of  five  levels.  The  total  number  of  rules 
at  each  level  is: 

Grammar  1 124  rules 

Grammar  2 506  rules 

Grammar  3 2408  rules 

Grammar  4 336  rules 

Grammar  5 2744  rules 

There  are  three  ways  of  arrangement  of  the  rules: 

1.  five-level  grammar  by  levels 

2.  five-level  grammar  by  length 

3.  concordance  of  rules 
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3.3  Chinese  Texts 


Physics  Texts  (papers  numbered  1 to  37). 

Total  telecodes:  421,464 

The  Physics  Texts  were  coded  from  the  following  books  and  articles: 

a.  Yuanzineng  jichu  zhishi.  Huadong  Shifan  Daxue.  1958.  114pp. 

b.  Yuanzineng  de  yuanli  he  yingyong.  Kexue  Chubanshe.  1965. 

} St  4*  >'&  #), 

c.  Yuanziheneng.  Kexuejishu  Chubanshe.  1957.  262pp. 

ft  Mis!. 

Yuanzineng  de  jiben  lilun  yu  yuanzineng  de  heping  gongxian. 

(no  publisher).  1966.  44pp. 

(fy  4 fi5?  5 /vf-  ^ ^ 4"  j ft  fyk . 

Ci  liuti  lixue  Zft  ^ 

Gaowen  dengliziti  donglixue  i ilk.  % iil  b t>  %. 

Gaowen  dengliziti  de  fushe  ,,1  \<^_  3j-  fa} 

Gaowen  dengliziti  zhenduan  fangfa  £ji \(%  /'£  $Vj'  7fl  \ 

Tongweisu  he  shexian  de  yingyong  (xia) 

iil*-  b) 

Jige  xinde  ji  jingguo  gaizhuang  de  yanjiuxing  rezhongzi  fangying- 
dui.  Jl  ‘i  /\  >2  '4  ^ rr]  ‘fc  \ f^iUg  V\ 

P($T  fanyingdui  zhongzi  tongliang  de  zengjia  ji  shiyan  kenengxing 
de  kuoda  /'l “i  t)l  ’f  > ^ **'3  X. 

Shiyanxing  qingshui  nongsuoyou  fanyingdui  (BBP-2)  de  gaizhuang 

V .!<•  it  {&  '<<t>  *ii  im-  :tL 

Zhongshui  fanyingdui  (TP)  de  gaizhuang 

*f  ‘ I v.  -v  4 <i\  ( T p ) -^  ( 

Gondii  wei  2000  wa  de  chenruxing  shiyan  fanyingdui  (HPT) 

Ip  ¥ A -Vn>  % ^ |7C  X t A'fjt  k 4 ^PT) 

14  < 2 

Rezhongzi  tongliang  10  zhongzi/limi  • miao  de  yanjiu  fanyingdui 
(BBP-M)  4 ^ /d’V4  } /j%  '<  * • 4-y  7|  /t  ^ t 

Fushe  huaxue  janjiu  zhuanyong  fanyingdui  (BBP-U) 


d. 


e. 

f. 

8- 

h. 

i. 

J- 


1. 


m. 


n. 


o. 
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q.  Yuanzi  he  yuanzineng.  Jiaoyu  Tupian  Chubanshe.  1956.  121pp. 

it &U%.  i imUtLi* 

2.  Biochemistry  Texts  (papers  numbered  1 to  17). 

Total  telecodes:  59,320 

3.  "Tokuyama"  Texts  (sample  excerpts  from  modem  Chinese  short 
stories  ca.  1920-30,  obtained  on  a cooperative  project  with  Dr. 
Helen  Tokuyama  of  the  University  of  California  at  Irvine). 

Total  telecodes:  83,830 

Total  Machine-Readable  Text  Telecoded:  564,614 

The  Physics  Texts  c,  e through  q and  the  Tokuyama  Texts  exist  both  on 
magnetic  tape  and  on  the  original  Chinese  Teleprinter  paper  tape.  The 
rest  of  the  Physics  Texts  and  the  Biochemistry  Texts  exist  in  80  column 
cards. 

3.4  Chinese  Character  I/O  Information 

1.  Kuno  character  vectors  (7,000  records) 

2.  Telecode-Romanization  Table  (10,000  cards) 

3.  Chinese  Teleprinter  Keyboard  to  Telecode  Table  (4,800  cards) 

4.  Four-Corner  System  Romanization  Equivalences  (1,500  cards) 

5.  Original  Cards  for  Chinese  Character  Indexes  Volumns  (14,000  cards) 

6.  Augmentation  to  Chinese  Character  Indexes,  with  romanization 
equivalents  (14,000  cards) 


| 

| 
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Supplements 


1. 

Test  Canonization  Module 

( CANON Z) 

2. 

Vestlgand  Formation  Module 

(INVEST) 

3. 

Dictionary  Lookup  Module 

(LOOKUP) 

4. 

Probative  Parser  Module 

(PARSER) 

5. 

Parse  Table  Print  Module 

(PARENT) 

6. 

String  Extraction  Module 

(STREXT) 

7. 

Grammar  Adaptation  Module 

(ADAPTG) 

8. 

Random  Dictionary  Module 

(RANDIC) 

9. 

SAS  Compatibility  Module 

(ALCHEM) 

10. 

Quince  Utilities  Module 

(QUTILS) 

11. 

Loader  Storage  Allocation  Module 

(BLKDAT) 

12. 

Common  Block  Definitions 

(CBADEF) 

13. 

Field  Function  Definitions 

(FFNDEF) 

14. 

GASP  System  Language  Pre-processor 

(GSPSRC) 

III.  THE  STATE  OF  THE  ART  IN  COMPUTATIONAL  SYNTACTIC 


DESCRIPTION  OF  NATURAL  LANGUAGES 
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I . Introduction 

When  the  recent  history  of  linguistics  Is  viewed  from  the  perspective 
of  computational  linguistics  and  machine  translation,  It  may  fairly  be  said 
that  the  most  conspicuous  event  remains  the  Introduction  of  the  context-free 
phrase-structure  grammar  by  Noam  Chomsky  In  the  middle  1950's.  Despite  the 
variety  of  alternative  formalisms  for  the  description  of  languages  which  have 
been  Introduced  (by  Chomsky  and  others)  In  the  Intervening  twenty  years,  It 
Is  still  the  context-free  grammar  which  dominates  the  thinking  of  computa- 
tional linguists  and  dominates,  also,  the  systems  which  they  devise. 

There  are,  however,  certain  technical  difficulties  with  the  use  of 
context-free  grammars  which  have  led  computational  linguists  to  "augment" 
their  grammars  with  "features",  and  with  "conditions"  or  "actions"  based  on 
the  features.  This  Is  true  of  the  well-known  systems  of  today  (e.g.,  Woods' 
ATN  grammars  and  Winograd's  systemic  grammars  are  of  this  kind),  and  has 
been  true  stretching  back  to  the  days  of  the  COMIT  programming  system  of 
Yngve.  Most  computational  linguists  believe  that  such  augmented  context-free 
grammars  are  sufficient  to  describe  natural  languages,  at  least  In  some  rough 
practical  way,  although  there  Is  no  real  theory  explaining  how  to  use  the 
augmentations,  or  why  they  are  so  helpful. 

Such  augmentation  devices  were  also  used  (though  much  less  formally 
and  systematically,  of  course)  by  linguists  of  the  pre-Choraskian  American 
structuralist  tradition  to  describe  such  phenomena  as  agreement  and  context- 
ual restrictions;  this  Is  one  aspect  of  their  procedures  which  was  never 
reconstructed  satisfactorily  in  phrase-structure  grammars.  Furthermore,  the 
lack  of  such  augmentation  devices  has  proved  troublesome  in  current  linguis- 
tic uses  of  context-free  grammars,  and  they  have  now  been  re-introduced  in 
the  most  recent  work  on  the  base  component  of  Chomskian  transformational 
grammars  — first  with  features  of  lexical  items,  and  then  with  complex- 
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symbol  representations  for  all  nonterminals  of  the  grammar. 

In  an  entirely  unrelated  development,  as  a way  of  defining  programming 
languages,  the  properties  of  the  syntactic  description  method  known  as  "van 
Wijngaarden  grammars"  or  "the  Algol  68  definition  method"  have  recently 
become  better  understood.  It  now  appears  that  this  method  of  describing 
"context-free  grammars  with  structured  vocabulary"  does  reconstruct  an  impor- 
tant element  common  to  structuralist  linguists,  recent  Chomskian  linguists, 
and  computational  "augmentations":  that  is,  the  use  of  significant  abbrevia- 
tory  conventions  in  context-free  grammars  by  exploiting  a systematically- 
structured  vocabulary  of  symbols. 

Not  only  does  the  van  Wijngaarden  syntactic  description  method  appear 
to  neatly  cover  a wide  variety  of  extensions  to  context-free  grammars  and 
thus  give  insight  into  what  important  properties  they  share,  but  related 
formalisms  (Roster's  affix-grammars,  Knuth's  attribute-grammars)  offer 
similar  properties  while  also  being  naturally  related  to  context-free 
grammars  in  the  sense  that  the  naturalness  of  interpretation  and  attractive 
parsing  properties  of  context-free  grammars  are  preserved.  Hence,  although 
these  formalisms  have  all  been  developed  in  connection  with  programming 
languages,  they  appear  to  be  of  even  greater  interest  and  importance  for  the 
processing  of  natural  languages. 

In  this  chapter  we  will  review  the  state  of  the  art  in  defining 
grammars  for  natural  languages  which  are  suitable  for  computational  use  in 
machine  translation  systems,  giving  particular  stress  to  the  new  methods 
just  mentioned  as  models  for  a good  deal  of  current  unformalized  practical 
knowledge.  We  will  attempt  to  provide  an  overview  of  the  progress  in  defining 
programming  languages,  and  to  relate  the  new  features  of  this  work  to  prior 
descriptive  methods  used  by  linguists.  Finally,  we  indicate  how  this  work  is 
related  to  the  parsing  implemented  in  the  Quince  system  at  the  Project  on 
Linguistic  Analysis,  and  what  further  research  is  needed  over  the  next  three 
to  five  years  to  incorporate  these  improved  techniques. 
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2.  Context-Free  Grammars 


We  began  with  the  observation  that  all  computational  systems  currently 
used  for  research  on  natural  language  are  based  on  context-free  grammars, 
even  if  they  also  incorporate  much  additional  machinery  to  interpret,  or 
translate,  or  whatever.  In  this  section  we  will  review  the  advantages  of 
the  context-free  grammar  formalism  which  have  led  to  this  state  of  affairs, 
the  extensions  to  the  basic  context-free  grammar  which  are  introduced  to 
counter  certain  disadvantages,  and  the  remaining  difficulties. 

2. 1 Advantages  of  Context-Free  Grammars 

The  reasons  for  the  pre-eminent  popularity  of  the  context-free  grammar 
formalism  are  many.  First,  perhaps,  is  the  fact  that  a context-free  grammar 
both  defines  a set  of  admissible  strings,  by  giving  a set  of  constraints  on 
the  ordering  of  elements,  and  also  associates  with  each  string  in  its  language 
a hierarchical,  tree-like  structural  description.  It  turns  out  that  almost 
always  one  wishes  both  to  separate  valid  strings  from  invalid,  and  also  to 
assign  structures  to  the  valid  ones;  perhaps  it  is  only  so  because  the  tool 
is  at  hand,  but  this  has  seemed  a logical  single  task. 

It  is  also  true  that  in  an  amazingly  wide  range  of  applications  the 
context-free  grammar  has  seemed  to  be  "a  natural  conceptual  basis  for  defi- 
nitions; the  basis  must  correspond  to  the  way  we  actually  think  about  (what 
is  being  defined),  otherwise  the  related  formalisms  are  not  likely  to  be 
fruitful"  (Knuth  1971).  In  large  part,  context-fi.e  grammars  have  been  such 
a "fruitful  formalism"  because  of  the  declarative  character  of  a grammar. 
Donald  Knuth,  again,  says  that  "a  grammar  is  'declarative'  rather  than 
'imperative';  it  expresses  the  essential  relationships  between  things  without 
implying  that  these  relationships  have  been  deduced  using  any  particular 
algorithm"  (Knuth  1971).  This  notion  of  a grammar  as  a set  of  declarative 
"well-formedness  conditions"  is  also  familiar  to  linguists,  from  McCawley's 
discussion  of  the  phrase-structure  base  component  of  /*.  transformational 
grammar  (McCawley  1968).  (A  frequent  shortcoming  of  computational  research 
on  natural  languages,  especially  that  conducted  by  non-linguists  unoer  the 
name  of  "artificial  intelligence",  has  been  to  extend  context-free  grammars 
In  procedural  ways,  apparently  out  of  a lack  of  appreciation  for  declarative 
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formalisms;  see,  e.g.,  Winograd  1971,  1975.) 

Finally,  not  only  is  it  true  that  relatively  small  context-free 
grammars  are  easy  for  human  beings  to  devise,  understand,  and  improve,  but 
they  are  also  easy  for  computers  to  manipulate.  There  have  always  existed 
algorithms  for  parsing  with  a context-free  grammar,  and  in  recent  years 
extremely  good  algorithms  have  been  described  and  refined  in  many  variants 
appropriate  for  a wide  range  of  purposes  (Aho  and  Ullman  1973). 

2. 2 Disadvantages  of  Context-Free  Grammars 

There  are,  to  be  sure,  some  disadvantages  of  context-free  grammars,  and 
they  spring  to  mind  even  more  readily  than  do  the  advantages  since  they  are 
a constant  source  of  difficulty. 

The  theoretical  difficulties  may  be  dispensed  with  — such  things  as 
the  inability  to  have  infinite  branching  from  a single  node  (so  as  not  to  pqt 
an  upper  bound  on  the  number  of  items  in  a coordinate  structure) , or  the 
inability  to  deal  with  unbounded  overlapping  dependencies  of  the  sort  which 
are  are  well-known  to  be  a prominent  feature  of  Mohawk  (Postal  1964b). 
(Postal's  criticism  is  sound,  although  just  slightly  askew;  it  is  revised  in 
Fldelholtz  1974.)  These  difficulties  are  true,  but  irrelevant.  Such  examples 
show  that  neither  in  terms  of  weak  generative  capacity  (the  sets  of  strings) 
nor  in  terms  of  strong  generative  capacity  (the  sets  of  structural  descrip- 
tions) do  context-free  grammars  provide  a description  of  natural  language 
surface  structures;  but  as  a practical  matter  they  cause  no  particular 
trouble. 

The  fact  that  these  are  the  wrong  terms  in  which  to  discuss  the  ade- 
quacy of  context-free  grammars  becomes  clear  from  the  observation  that,  if  we 
simply  restrict  a natural  language  to  sentences  short  enough  to  fit  in  six- 
point  type  between  California  and  Alpha  Centauri,  then  the  restricted  lan- 
guage will  be  finite  and  hence  trivially  a Chomsky  type  3 (finite-state) 
language,  and  thus  a fortiori  context-free. 

The  point  is  that  a grammar  which  is  not  clear  enougu  to  be  invented 
and  Improved  by  human  beings  cannot  be  produced;  and  clarity,  as  Edsger 
Dijkstra  observes,  "has  pronounced  quantitative  aspects"  (Dijkstra  1972). 

The  chief  practical  difficulty  with  context-free  grammars  for  natural 
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languages  has  been  their  large  size  and  their  corresponding  lack  of  trans- 
parency. Susumo  Kuno  (1963)  reported  on  an  English  grammar  containing  133 
syntactic  categories  and  over  2100  rules,  which  did  not  yet  incorporate  the 
obvious  agreement  restrictions  of  English.  Grammars  with  upwards  of  10,000 
rules  are  known  to  exist. 

As  the  number  of  rules  grows  into  the  thousands,  and  as  it  is  realised 
that  tens  of  thousands  of  rules  would  be  only  a beginning,  all  the  practical 
advantages  of  context-free  grammars  disappear.  Such  grammars  are  no  longer 
at  all  easy  to  understand,  nor  are  they  easy  to  manipulate  for  computer  use. 

A typical  experience,  repeated  over  and  over  throughout  the  1960's  and 
early  1970's,  has  been  that  a context-free  grammar  can  be  written  readily  to 
serve  as  an  initial  demonstration  model  over  a limited  range,  but  that  re- 
placing that  context-free  grammar  with  one  adequate  for  actual  natural 
language  is,  practically  speaking,  impossible.  As  Samuel  Johnson  wrote 
in  the  preface  to  his  Dictionary  of  1755,  "a  large  work  is  difficult  because 
it  is  large,  even  though  all  its  parts  might  singly  be  performed  with 
facility." 


2.3  Why  Context-Free  Grammars  Grow  Large 

Context-free  grammars  grow  large  beyond  the  effective  power  of  humans 
to  contral  them  primarily  because  of  the  need  to  encode  within  them  res- 
trictions on  contexts.  This  is,  of  course,  not  a contradiction  or  a paradox; 
the  name  "context-free"  refers  to  the  form  of  the  rules  in  the  grammar, 
not  to  any  impossibility  of  utilizing  context  to  restrict  the  language  of 
the  grammar.  As  every  linguist  should  know  by  now,  Peters  and  Ritchie 
(1973)  contains  a demonstration  that  every  language  which  can  he  "analyzed" 
by  testing  putative  structural  descriptions  using  context-sensitive  well- 
formedness  rules  is  a context-free  language  — that  is,  it  also  is  generated 
by  a context-free  grammar,  although  a context-free  grammar  which  may  have 
many,  many  rules. 

A small  example  of  the  way  in  which  the  need  for  context  expands 
context-free  grammars  is  given  by  Winograd  (1971).  He  exhibits  the 
grammar : 
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1.  S -*  NP  VP 

2.  NP  “*  DET  NOUN 

3.  VP  VERB/INTRANS 

4.  VP  -*■  VERB/TRANS 

5.  DET  -*  the 

6.  NOUN  —■  giraffe 

7.  NOUN  apple 

8.  VERB/INTRANS  "*  dreams 

9.  VERB/TRANS  **  eats 

which  generates  derivations  such  as: 


NP 

DET  NOUN 

I DET 

I I 

the  giraffe  eats  the 

Winograd  points  out,  though,  that  to  expand  the  grammar  so  as  to  include 
number  agreement  for  subjects,  thus  giving: 

The  giraffes  eat  the  apple. 

The  giraffe  eats  the  apple, 
but  not : 

*The  giraffes  eats  the  apple. 

*The  giraffe  eat  the  apple. 

requires  (if  we  are  to  he  strictly  observant  of  the  notion  of  a context-free 
grammar)  that  we  introduce  new  category  symbols  to  code  the  terminal 


NOUN 

I 

apple 
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vocabulary.  We  must  add  rules: 
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6a.  NOUN/SG  giraffe 
6b.  NOUN/PL  *♦  giraffes 
9a.  VERB/INTRANS/SG  "*  dreams 
8b.  VERB/INTRANS/PL  "*  dream 
9a.  VERB/TRANS/SG  -*  eats 
9b..  VERB/TRANS/PL  — eat 

and  we  must  also  double  the  number  of  rules  above  the  terminals,  adding 
additional  non-terminal  vocabulary  as  necessary: 

la.  S’*  NP/SG  VP/SG 

lb.  S’*  NP/PL  VP /PL 
2a.  NP/SG  ■*  DET  NOUN/SG 
2b.  NP/PL  -*  DET  NOUN/PL 
3a.  VP./SG  -*  VERB/INTRANS/SG 
3b.  VP/PL  -*  VERB/INTRANS/PL 
4a.  VP/SG  "*  VERB/TRANS/SG  NP 
4b.  VP/PL  -*  VERB/TRANS/PL  NP 

(Observe  that  two  symbols  in  this  grammar  such  as  NP/SG  and  NP/PL  are  wholly 
distinct  symbols  from  the  standpoint  of  the  definition.  Their  similarity  in 
spelling  is  a help  to  the  human  reader  in  grasping  the  significance  of  the 
symbols  in  the  grammar,  but  the  grammar  itself  does  not  exploit  the  simi- 
larity.) We  now  have  derivations  such  as: 


I 

‘ 'I 

3 
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NP/SG 


VP/SG 


VERB/INTRANS/SG 

I 

dreams 

This  is  straightforward,  and  it  is  clear  that  the  way  in  which  we  can 
enforce  number  agreement  in  context  is  by  duplicating  the  vocabulary  of 
the  grammar  and  the  productions  of  the  grammar  all  the  way  back  from  two 
items  which  must  agree  (such  as  NOUN/SG  and  VERB/INTRANS/SG)  to  their 
common  parent  (here,  clear  back  to  the  start  symbol  S) . If  the  agreement 
possibilities  have  three  values  (masculine,  feminine,  and  neuter,  say) 
then  the  symbols  and  the  rules  must  be  multiplied  by  three,  and  so  forth. 

It  becomes  discouraging  to  note  that  a similar  multiplication  will 
be  required  for  every  individual  feature  of  context  which  must  be  coordinated 
— next,  for  instance,  we  might  notice  that  our  grammar  even  with  number 
agreement  for  subjects  will  derive: 


GET  NOUN/SG 


the  giraffe 


DET  NOUN/PL 


VERB/TRANS /PL  NP 


DET  NOUN 


the 


the 


apples 


eat 


giraffe 


(We  can  assume  that  the  rule  re-wrlting  plain  NP  still  exists  In  the 
grammar,  because  object  number  agreement  Is  not  necessary;  alternatively  we 
could  double  the  VP  rules  again,  to  Introduce  freely  both  NP/SG  and  NP/PL 
as  objects.)  This  would  lead  us  to  double  once  again  to  code  the  correct 
restrictions  for  "animate  subject:,  leading  to  terminal  vocabulary  such  as: 

NOUN/SG/ANIM  -*  giraffe 
NOUN/ PL/ AN IM  *♦  giraffes 
NOUN/SG/NONANIM  “*  apple 
NOUN /PL/ NONAN I M apples 
VERB/TRANS/ SG/ANIMSUBJ  eats 
VERB/TRANS/ PL/ ANIMSUBJ  -*  eat 

and  again  wo  must  double  the  rest  of  the  productions,  beginning  with: 

la.  S -*•  NP/SG/ ANIM  VP/SG/ ANIMSUBJ 

lb.  S **  NP/PL/ ANIM  VP/PL/ ANIMSUBJ 

lc.  S “ NP 'SG/NONANIM  VP/SG/NONANIMSUBJ 

ld.  S -*  NP/PL/NONANIM  VP/PL/NONANIMSUBJ 

Kven  though  we  have  already  passed  the  point  of  reasonableness,  it  is 
obvious  that  we  must  continue  this  process  for  a long  time.  As  W1 nograd 
remarks,  "this  sort  of  duplication  propagates  mul t ipl ieatively  through  the 
grammar , and  arises  in  all  sorts  of  cases." 

This  then  is  the  wav  in  which  the  desire  to  encode  context  restrictions 
causes  a context  free  grammar  to  grow,  bv  multiplying  its  sequences  of 
rules  over  and  over  again. 

2.4  Attempts  to  Reduce  the  Size  of  Con  text -Free  Grammars 

The  preceding  analvsis  of  how  context-free  grammars  become  un- 
manageable suggests  that  some  method  is  needed  to  simplify  grammars  bv 
indicating  where  all  the  parallel  sets  of  rules  occur.  In  practice, 
virtually  every  serious  use  of  context-free  grammars  has  introduced  some 
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such  mechanism  (even  if  only  informally),  and  such  mechanisms  will  he  the 
subject  of  following  sections. 

In  addition  to  these  methods,  however,  there  have  been  at  least  two 
attempts  to  reduce  the  size  of  large  grammars  by  a process  of  factoring 
them  into  smaller  grammars,  a sort  of  "divide  and  rule"  strategy. 

The  first  of  these  was  Woods's  notion  of  creating  a "regular  expression 
grammar"  (Woods  196*1).  In  principle,  this  consists  of  ar.  algorithm  which 
takes  an  arbitrary  context-free  grammar  and  factors  it  into  a set  of 
regular  (type  3)  grammars,  plus  the  essentially  context-free  transitions 
between  them.  In  practice  it  seems  that  all  examples  have  been  composed 
by  hand  in  already  factored  form.  In  addition  to  the  gain  of  breaking  a 
larger  context-free  grammar  into  a number  of  smaller  grammars,  it  was 
pointed  out  that  the  smaller  grammars  could  be  improved  by  using  optimization 
techniques  applicable  to  regular  grammars.  Such  grammars  are  of  some 
interest,  and  they  need  not  be  developed  in  t lie  "procedural"  AT  terminology 
of  transition  networks  as  Woods  has  chosen  to  develop  them.  (See  Lalonde 
1977  for  a development  as  "regular  right-part  grammars"  leading  to  a theory 
and  an  implementation  which  appear  more  attractive  than  those  of  Woods.) 

Whereas  Woods  broke  up  grammars  "horizontally",  the  other  attempt 
was  to  break  up  grammars  "vertically"  into  smaller  pieces  applied  sequen- 
tially; this  was  the  "hierarchical  sub-grammar"  mechanism  introduced  in 
the  Quince  system  of  the  Project  on  Linguistic  Analysis  (Wang  and  Chan  1975, 
Caskins  1973).  The  POLA  technique  required  human-separation  of  a large 
grammar  into  pieces,  with  the  non-terminal  symbols  of  some  grammars  serving 
as  terminals  of  other  grammars.  In  the  development  of  such  grammars  it 
was  thus  possible  to  separate  some  concerns,  because  in  applying  them  to 
parsing  it  was  possible  to  utilize  alternative  grammars  depending  on  the 
tree-tops  developed  by  a preceding  grammar  application. 

Both  of  these  ways  of  making  one  large  context-free  grammar  into 
several  smaller  ones  are  useful,  though  they  address  such  different  goals 
that  it  is  impossible  to  compare  their  relative  effectiveness  (and  they 
are  not  mutually  exclusive).  But  they  have  in  common  that  neither  really 
does  much  about  the  multiplicative  duplication  of  vocabulary  symbols  and 
rules  previously  described.  Hence,  it  has  been  necessary  to  augment  both 
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schemes  with  additional  extensions  to  control  multiplicative  duplication, 
which  would  still  be  crippling  If  not  contained. 

3 . A Model  for  Context -_Free  Grammars  with  Structured  Vocabulary 

Once  we  have  identified  the  problem  of  multiplicative  duplication 
of  vocabulary  and  rules  in  a context-free  grammar,  it  is  tolerably 
obvious  what  to  do  about  it:  we  must  introduce  a notational  system  which 
allows  the  process  of  duplication  to  be  implied,  rather  than  requiring 
that  it  be  carried  out  at  full  length.  In  fact,  this  step  is  so  obvious 
that  it  has  been  taken  by  almost  everyone  who  ever  wrote  grammars  to  he 
read  by  human  beings,  but  the  variety  of  different  notions  and  notations 
Introduced  has  been  so  bewildering  that  the  similarities  have  not  generally 
been  appreciated.  For  this  same  reason  the  device  has  not  been  developed  as 
well  as  it  could  be  for  use  in  descriptions  of  natural  languages. 

There  is  now  an  elegant  and  comprehensive  notation  available  for 
writing  context-free  grammars  with  systematically-structured  vocabularies 
and  productions,  and  this  is  the  "two-level"  grammar  devised  by  Adrian 
van  Wijngaarden  for  the  definition  of  the  prognimming  language  Algol  63. 

The  van  Wijngaarden  grammars  (in  some  intuitive  sense)  cover  and  include 
all  the  other  proposals,  and  are  the  simplest  technique  available.  This 
descriptive  system  is  unfortunately  not  widely  known,  despite  (perhaps 
because  of)  the  publicity  given  to  the  language  Algol  68. 

In  this  section,  accordingly,  we  will  introduce  the  notion  of  a 
van  Wijngaarden  grammar.  We  will  then  relate  it  to  a number  of  descriptive 
techniques  used  by  linguists,  to  a number  of  extensions  of  context-free 
languages  used  by  computational  linguists,  and  also  to  related  formalisms 
used  by  computer  scientists  to  define  formal  languages  and  programming 
languages  — In  particular,  to  the  attribute  grammars  of  Donald  Knuth  and 
the  affix  granmnrs  of  C.  H.  A.  Foster. 

Tt  is  not  at  all  clear  — a reader  should  know  in  advance  — that 
it  would  be  wise  to  adopt  the  van  Wijngaarden  grammar  format  as  an  actual 
encoding  of  rules  to  be  used  for  computational  analysis;  but  the  van 
Wijngaarden  grammar  abstracts  the  essential  problem  so  cleanly  that  it 
offers  the  indispensible  framework  of  insight  within  which  related  formalisms 
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can  be  understood  and  compared  with  one  another. 


3 . 1 van  Wljngaarden  Grammars 

The  Idea  of  a "two-level"  grammar  (or  "van  Wljngaarden  grammar", 
sometimes  also  W-grantmar  or  vW-gratnmar)  was  originated  by  Adriaan  van 
Wljngaarden,  for  many  years  the  director  of  the  Mathematisch  Centrum  at 
Amsterdam,  as  a method  for  describing  the  programming  language  Algol  68 
then  under  development  by  an  international  committee  (van  Wljngaarden  1965, 
van  der  Poel  1971).  A van  Wljngaarden  description  of  Algol  68  appeared 
in  1969  (van  Wljngaarden  et  al.  1969)  which  failed  to  exploit  the  potential 
of  the  method;  a revised  description  appeared  in  1975  (van  Wljngaarden  et 
ql.  1975),  which  for  the  first  time  showed  to  advantage  the  two-lovel 
grammar  idea,  and  Interest  in  the  method  revived  to  some  degree  (see 
Cleaveland  and  Uzgalls  1977). 

Unfortunately  for  the  descriptive  method,  the  Algol  68  language  has 
not  been  well  received  (specimen  reaction,  attributed  to  P.  7, . Ingerman: 
"This  language  fills  a much-needed  gap"),  and  the  widespread  distaste  for 
the  language  Algol  68  has  contributed  to  the  unpopularity  of  the  method 
specially  devised  to  describe  it.  Worse  still,  the  Algol  68  Report 
introduced  a slightly  different  notation  for  a context-free  grammar,  and 
this  impeded  discussion  further.  (The  Algol  68  Report  notation  for  a van 
Wijngaarden  grammar  will  not  be  used  here,  but  a short  specimen  is  included 
at  the  end  of  section  3.2)  But  the  method  of  syntax  description  is  really 
elegant  and  important,  and  can  be  divorced  from  Algol  68.  Tt  should  be 
better  known  to  computer  scientists  and  linguists. 

Let  us  begin  with  a couple  of  examples  of  van  Wijngaarden  grammars; 
after  the  examples  the  terminology  will  be  reviewed  at  length  and  made 
more  precise.  Tn  section  2.3  above,  we  considered  a grammar  from  Winogrnd 
with  rules  such  ns 


la. 

s -*■  np/sg 

vp.'sg 

lb. 

s ■"  np/pl 

vp  pl 

2a. 

np/ sq  det 

noun/sg 

2b. 

np/pl  “*  det 

noun/pl 
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and  so  forth.  (The  use  of  lower-case  letters  to  spell  the  symbols  is  a 
change,  which  will  be  explained  presently,  but  obviously  this  is  the  same 
grammar  as  when  upper-case  letters  were  used.) 

We  could  abbreviate  these  rules  by  taking  advantage  of  the  structure 
which  is  in  the  vocabulary  of  symbols  — i.e.,  the  systematic  relations 
between  the  pairs  of  symbols  such  as  "np/pl"  and  "np/sg".  We  can  introduce 
a cover  term  for  either  "sg"  or  "pi",  and  write  the  cover  symbol  as  "NUM" 
(using  upper-case  letters  for  cover  symbols,  lower-case  letters  for  the 
others).  Using  this  abbreviation,  the  rules  become 

hi.  <s>  : -+  ( np/  NUM>  <vp/  NUM ) 

h2.  ( np/  NUM)  (noun/  NUM) 

and  so  forth.  (We  will  use  the  angle-brackets  to  surround  the  symbols 
in  the  grammar,  since  each  one  may  consist  of  more  than  one  piece  such  as 
"noun/"  and  "NUM",  and  the  brackets  aid  in  visual  parsing  of  the  rules  by 
a human  reader.)  We  must  now  understand  each  of  these  rules  as  a "rule 
schema",  a pattern  for  generating  the  rules  given  before  by  plugging  in 
various  values  for  "NUM".  In  order  to  make  this  idea  precise,  we  will 
specify  the  values  that  "NUM"  can  assume  by  a separate  context-free 
grammar,  a "meta-grammar": 

ml.  NUM  : : sg  I pi 

It  is  very  important  also  to  stipulate  that  the  same  value  of  NUM  must  be 
inserted  into  each  occurrence  of  NUM  in  a rule  schema.  For  instance, 
there  is  not  a rule  such  as  " s np/sg  vp/pl  ",  because  that  could  only 
result  from  replacing  NUM  in  rule  hi.  with  'sg'  at  its  first  occurrence, 
but  with  'pi*  at  its  second  occurrence.  This  principle  we  will  refer  to 
as  the  Uniform  Replacement  Convention  (URC). 

These  two  sots  of  rules  above  constitute  a "two-level"  van  l.M  jngnarden 
grammar  corresponding  to  the  original  context-tree  grammar  (the  four 
rules  from  Wtnograd)  given  just  previously.  Those  four  original  rules  are 
represented  in  the  van  Wi jngaarden  grammar  by  (a)  rule  schemata  which 
incorporate  variables  In  context-free  rules  (like  rules  hi,  h2  above), 
and  (b)  a second  context-free  grammar  whose  terminal  strings  are  the  permitted 
values  of  the  variables  (like  rule  ml  above).  Terminal  strings  of  this 
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grammar  are  substituted  for  variables  in  the  rules  of  other  grammar, 
subject  to  the  Uniform  Replacement  Convention. 

The  name  used  for  the  variables  is  "meta-symbols".  The  second 
grammar,  naturally  enough,  is  called  the  "meta-grammar"  because  it  defines 
the  meta-symbols.  The  first  grammar  is  called  a "hyper-grammar" , into 
which  the  substitutions  are  made.  Accordingly,  we  have  three  kinds  of 
rules : 

(1)  Meta-rules  are  context-free  rules  which  define  the  possible 
values  of  meta-symbols : 

ml.  NUM  : : ~+  sg  I pi 

(2)  Hyper-rules  are  schemata  for  context-free  rules,  whose  symbols  may 
contain  meta-symbols: 

hi.  <s>:-,,<np/  NUM ) < vp/  NUM> 

h2 . ( np/  NUM>:  “*•  < det>  (noun/  NUM> 

These  two  sets  of  rules  together  make  up  a "two-level"  van  Wijngaarden 
grammar.  They  define  the  language  generated  by  their  production  rules: 

(3)  Product ion- rules  are  the  context-free  rules  which  can  be  produced 
from  the  schematic  hyper-rules  by  systematic  replacement  of  meta-symbols 
according  to  the  Uniform  Replacement  Convention. 

The  meta-rules  and  hyper-rules  of  the  van  Wijngaarden  grammar  above 
specify  the  four  production-rules: 


pi. 

s — ► np/sg. 

vp/sg 

P2. 

s np/pl 

vp/pl 

p3. 

np/sg  -*  det 

noun/sg 

p4. 

np/pl  ■*  det 

noun/pl 

In  this  case,  as  very  frequently  happens,  the  two  sets  of  rules 
in  the  van  Wijngaarden  grammar  (the  meta-rules  and  the  hyper-rules) 
specify  a set  of  production-rules  which  could  perfectly  well  be  written 
down  in  full,  as  we  have  just  done.  Rut  the  van  Wijngaarden  format  is 
shorter  (not  much  here,  but  often  very  much  shorter),  and  more  importantly 
the  van  Wijngaarden  grammar  preserves  the  information  that  rules  which 


* 


i 
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differ  only  In  the  number-agreement  specified  ("NUM")  are  really  two 
instances  of  the  same  rule  schema.  This  formal  generalization  corresponds 
to  a linguistic  claim  that  sentences  with  singular  subjects  do  not  have  a 
different  grammar  from  sentences  with  plural  subjects;  in  fact  the  grammar 
is  the  same,  but  there  must  be  number  agreement  between  the  verb  and  its 
subject.  Also,  in  the  van  Wijngaarden  format  the  rules  for  sentence 
formation  could  be  modified  by  changing  just  the  appropriate  single 
rule  schema  (hyper-rule),  leaving  the  definition  of  the  meta-variable  NUM 
in  the  meta-rules  unchanged. 

The  preservation  of  this  kind  of  structure  in  the  vocabulary  of 
symbols  is  a most  important  feature  of  van  Wijngaarden  grammars  for 
natural  languages,  and  it  amounts  to  more  than  just  a clever  abbreviation 
for  an  ordinary  context-free  grammar.  As  we  will  see  later,  a two-level 
grammar  can  define  languages  which  have  no  context-free  grammar,  and  van 
Wijngaarden  grammars  are  actually  equivalent  to  Chomsky  type-0  grammars, 
or  unrestricted  rewriting  systems. 

It  will  be  apparent  to  every  linguist  that  the  use  of  meta-symbols 
in  context-free  grammars  is  an  old  custom  in  linguistic  description  (and 
we  will  examine  some  of  those  older  uses  below);  the  distinctive  contribu- 
tions of  the  van  Wijngaarden  grammar  are  some  of  the  techniques  for  ex- 
ploiting such  meta-symbols,  and  the  explicit  use  of  (what  else?)  a second 
context-free  grammar  to  derive  the  meta-symbols.  The  whole  arrangement 
seems  extremely  obvious,  but  it  turns  out  to  have  some  very  non-obvious 
properties. 

3 . 2 Notation  for  van  Wijngaarden  Grammars 

Since  we  now  have  three  kinds  of  rules,  three  kinds  of  symbols, 
and  so  forth,  it  is  best  to  have  a very  clear  notation  for  keeping  them 
separate.  This  section  introduces  the  full  nomenclature  in  a step-by- 
step  fashion,  and  at  the  end  of  the  section  there  is  a reference  summary 
of  the  notation  which  can  be  consulted  while  reading  the  remainder  of  this 
chapter. 

We  will  use  upper-case  letters  for  meta-symbols . and  lower-case 
letters  for  symbols  such  as  the  ones  which  can  be  derived  from  meta- 
symbols (we  call  these  lower-case  symbols  proto-symbols,  but  their  name 
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seldom  comes  up).  The  grammar  of  the  meta-symbols  is  the  meta-grammar. 
Each  meta-rule  in  such  a meta  grammar  will  take  the  form  of  expanding  one- 
single  meta-symbol  on  the  left  side  of  a meta-rule  into  a string  of  meta- 
symbols and  proto-symbols.  Thus,  the  meta-symbols  are  the  non-terminals 
of  the  meta-grammar  and  the  proto- symbols  are  its  terminals;  that  is  why 
meta-symbols  are  upper-case  and  proto-symbols  are  lower-case,  as  is 
customary  in  an  ordinary  context-free  grammar  written  according  to  the 
usual  Chomsky  conventions.  For  example,  a meta-grammar  of  three  meta- 
rules is  the  following: 

SUBJ  •'  : -*  ANIMATE 

ANIMATE  : : minus-animate  I plus-animate  HUMAN 

HUMAN  : •'  -*  min  us -human  I pi  us -human 

where  SUBJ,  ANIMATE,  and  HITMAN  are  the  non-terminals  (meta-symbols),  and 
minus-animate,  plus-animate,  minus-human,  and  plus-human  are  the  terminals 
(proto-symbols).  (There  is  really  no  requirement  to  spell  out  "plus"  or 
"minus",  but  the  style  customary  in  writing  these  grammars  Is  to  use  long 
names  for  symbols  — probably  a bad  stylo.) 

In  the  meta-rules,  the  symbol  : : ->  is  used  for  the  "re-write" 
symbol;  a different  re- write  symbol  is  used  In  each  type  of  grammar  so 
that  a single  rule  in  isolation  can  always  have  its  type  identified.  Any 
rule  with  the  double-colon  arrow  is  necessarily  a neta-rule.  The  usual 
vertical  bar  is  used  to  separate  alternatives  on  the  right  side  of  rules, 
and  the  same  bar  is  used  in  all  grammars.  Alternatives  can  always  be 
written  as  additional  rules  at  the  option  of  the  grammar  writer,  so  the 
three  meta-rules  above  are  equivalent  to  the  five  (unabbreviated)  meta- 
rules: 

SUBJ  : ; - ANIMATE 
ANIMATE  : : "*■  minus-animate 
ANIMATE  : : -*■  plus-animate  HUMAN 
HUMAN  : ’•  -*  minus-human 
HUMAN  : : plus-human 
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It  is  necessary  to  stress  that  these  meta-grammars  are  to  be  interpreted 
as  utterly-ordinary  context-free  grammars.  The  only  feature  which  is 
slightly  unusual  is  that  you  are  permitted  to  choose  any  non-terminal  as 
the  start-symbol  for  the  grammar,  and  then  the  values  of  that  non-terminal 
are  the  strings  it  derives  in  the  meta-grammar.  For  example,  in  this  meta- 
grammar if  SUBJ  is  chosen  as  the  start  symbol,  it  gives  three  possible 
strings  of  terminal  proto-symbols: 

SUBJ  derives:  minus-animate 

plus-animate  minus-human 
plus-animate  plus-human 

So,  in  the  rule  schemata,  wherever  SUBJ  appears  it  could  be  replaced  with 
any  of  these  three  strings.  But  if  HUMAN  is  treated  as  the  start  symbol 
of  the  meta-rules,  then  HUMAN  only  leads  to  two  strings  of  terminal 
proto-symbols : 

HUMAN  derives:  minus-human 

plus -human 

and  so  where  HUMAN  is  used  in  the  rule  schemata  it  could  be  replaced  only 
with  one  of  these  two  strings. 

Moving  now  to  the  other  component  of  a van  Wijngaarden  grammar,  we 
will  call  the  grammar  of  the  rule  schemata  the  hyper-grammar.  The  symbols 
used  in  the  hyper-grammar  are  hyper-symbols , which  are  strings  of  proto- 
symbols  (little  letters)  and  meta-symbols  (big  letters)  enclosed  in  angle 
brackets.  For  example,  np  SUBJ  is  a single  hyper-symbol,  which  contains 
within  its  angle  brackets  the  proto-symbol  'np'  and  the  meta-symbol 
'SUBJ'.  Each  rule  schema  is  called  a hyper-rule , and  takes  the  form  of 
a context-free  rule  rewriting  a single  hyper-symbol  on  the  left  side  as 
a string  of  hypcr-synbols.  For  example,  three  hyper-rules  are: 

< s ) : - < np  SUBJ ) < SUBJ  vp  > 

< np  SUBJ ' :_*(det  terminal)  (noun  SUBJ  terminal) 

( SUBJ  vp)  : < verb  SUBJ  terminal) 
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Each  has  one  hyper-symbol  on  the  left  side,  and  a string  of  hyper- 
symbols on  the  right  side.  In  hyper-rules  the  rewrite  symbol  is  : -»  , 
so  any  rule  with  a single-colon  arrow  is  a hyper-rule.  Ilyper-nlternatives 
are  separated  by  a vertical  bar,  just  as  with  meta-alternatives.  (There 
are  no  hyper-alternatives  in  the  rules  above.)  We  have  already  used  up 
the  distinction  between  upper-case  non-terminals  and  lower-case  terminals 
in  the  meta-grammar,  and  indeed  both  appear  intermixed  in  the  hyper- 
symbols. We  need  a new  convention  to  express  that  distinction  in  hyper- 
grammars,  so  by  convention  all  terminals  in  hyper-grammars  will  end  with 
the  proto-symbol  "terminal".  (In  the  hyper-grammar  above,  both  the 
second  and  third  rules  expand  to  strings  of  terminals  only.)  Some 
additional  mechanism  is  then  required,  such  as  a lexicon,  to  associate 
each  terminal  symbol  with  its  representation,  which  is  ordinary  computa- 
tional practice  anyway.  We  will  not  be  concerned  here  with  the  represen- 
tation of  terminals. 

The  three  hyper-rules  make  use  of  the  meta-symbol  SUB.T  which  (as 
we  saw  before)  derives  three  terminal  strings  in  the  meta-grammar:  since 
any  of  these  may  be  substituted  (observing  the  Uniform  Replacement  Con- 
vention) in  each  hyper-rule,  that  makes  the  hyper-rules  short  for  nine 
rules  in  total: 

(s)  ( np  minus-animate)  (minus-animate  vp) 

(s)  *•  ( np  plus-animate  minus-human) 

(plus-animate  minus-human  vp) 

(s)  "*  ( np  plus-animate  plus-human) 

(plus-animate  plus-human  vp) 

( np  minus-animate)  ( dot  terminal)  (noun  minus-animate  terminal) 

( np  plus-animate  minus-human)  **  ( det  terminal) 

(noun  plus-animate  minus-human  terminal) 

( np  plus-animate  plus-human)  ■*  ( det  terminal) 

(noun  plus-animate  plu-»- human  terminal) 

(minus-animate  vp)  ■*  ( verb  minus-animate  terminal) 

(plus-animate  minus-human  vp)  ■*  ( verb  plus-animate  minus-human  terminal) 

( plus -animate  plus-human  vp)  “*  ( verb  plus-animate  plus-human  terminal) 
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Rules  such  as  these,  generated  by  the  hyper-rule  schemata  by  replacing 
meta-symbols,  we  will  call  production-rules.  Such  production-rules 
have  the  plain  arrow  as  the  rewrite  symbol  (their  mark  as  regular  context- 
free  grammar  rules) , and  they  rewrite  a single  production-symbol  on  the 
left  side  as  a string  of  production  symbols.  The  production-symbols  are 
simply  the  concatenation  of  the  proto-symbols  within  a pair  of  angle 
brackets  after  substitution  has  taken  place;  the  brackets  are  ordinarily 
retained  for  ease  in  reading.  Observe  carefully  that  the  separation  be- 
tween distinct  proto-symbols  is  no  longer  present  after  replacement  of 
meta-symbols  has  taken  place;  a production-symbol  such  as  "s"  and  a 
production  symbol  such  as  "npplus-animateminus-human"  are  equally  thought 
of  as  just  single,  unanalyzeable  symbols  — exactly  as  they  would  be  in 
an  ordinary  context-free  grammar. 

For  reference,  we  now  insert  a summary  of  this  section: 

A van  Wijngaarden  grammar  (vW-grammar)  consists  of  two  components: 

(1)  a meta-grammar,  and  (2)  a hyper-grammar. 

The  purpose  of  the  meta-grammar  is  to  define  a structured  vocabulary 
of  symbols.  It  takes  the  form  of  a context-free  grammar  in  which  the 
non-terminals  are  meta-symbols  (written  in  UPPER  CASE  LETTERS),  and  the 
terminals  are  proto-symbols  (written  in  lower  case  letters).  Each  meta- 
rule consists  of  re-writing  a single  meta-symbol  as  a string  of  meta- 
symbols and  proto-symbols.  The  re-write  symbol  used  in  meta-rules  is  : : . 

The  symbol  used  in  meta-rules  to  separate  meta-alternatives  is  JL 

Example  meta-grammar  component  of  a vW-grammar: 

SUBJ  : : ANIMATE 

ANIMATE  : : nunus-animate  I plus-animate  HUMAN 

HUMAN  : : “*■  minus-human  | plus-human 

The  purpose  of  the  hyper-grammar  is  to  serve  as  a set  of  rule 
schemata  for  the  rules  defining  the  language  of  the  vV!-grammar.  It  takes 
the  form  of  a context-free  grammar  in  which  the  non-terminals  are  hyper- 
symbols (strings  of  proto-symbols  and  meta-symbols  enclosed  in  angle 
brackets  O ) and  the  terminals  are  hyper-symbols  ending  with  the  proto- 
svmbol  'terminal'.  Each  hyper-rule  consists  of  re-writing  a single 
hyper-symbol  as  a string  of  hyper-symbols.  The  re-write  symbol  used  in 
hyper- rules  is  . The  symbol  used  in  hyper-rules  to  separate  hyper- 
alternatives  is  . 

Example  hvper-grammar  component  of  a vU-grammar: 
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< s > : -*■  < np  SUBJ  > < SUBJ  vp  > 

( np  SUBJ ) :“*(det  terminal)  (noun  SUBJ  terminal) 
( SUBJ  vp)  : -*•  ( verb  SUBJ  terminal) 


In  a vW-grammar,  the  hyper-rules  specify  a grammar  (the  production- 
grammar)  which  is  obtained  by  replacing  in  the  hyper-rules  all  the  meta- 
symbols with  strings  of  proto-symbols  which  can  be  derived  in  the  meta- 
grammar by  treating  the  meta-symbol  as  the  start-symbol.  (Thus,  the 
meta-graramar  may  actually  be  made  up  of  several  sets  of  meta-rules  which 
do  not  interact.)  This  must  be  done  in  accordance  with  the  Uniform 
Replacement  Convention  (URC) , which  says  that  all  instances  of  the  same 
meta-symbol  in  a single  hyper-rule  must  be  replaced  with  the  same  string 
of  proto-symbols. 


The  purpose  of  the  product  ion -grammar  obtained  in  this  way  is  to 
specify  the  language  of  the  vh’-grammar.  It  takes  the  form  of  a context- 
free  grammar  (but  possibly  with  an  infinite  number  of  rules)  in  which 
the  non-terminals  are  concatenations  of  proto-symbols  (strings  of  lower 
case  letters),  and  the  terminals  are  concatenations  of  proto-symbols 
ending  in  'terminal'.  (Some  further  mechanism,  such  as  a lexicon,  is  then 
used  to  give  the  representation  of  each  terminal  symbol.)  Each  production- 
rule  consists  of  re-writing  a single  production-symbol  as  a string  of  pro- 
duction-symbols, The  re-write  symbol  used  in  production  rules  is  . 

The  symbol  used  in  production-rules  to  separate  production-alternatives 
is  £ , Angle  brackets  may  optionally  be  retained  around  production- 
symbols  to  aid  readability. 


Example  production-grammar  (specified  by  the  vW-grammar  above): 


<s) 

<s) 

<S> 


( np  minus-animate ) 
<np  plus-animate  minus-human) 


-*  ( np  minus-animate)  (minus-animate  vp) 

■*  ( np  plus-animate  minus-human) 

(plus-animate  minus-human  vp) 

-*  ( np  plus-animate  plus-human) 

(plus-animate  plus  human  vp) 

-*(det  terminal)  (noun  minus-animate  terminal) 


*♦  ( det  terminal  > 

(noun  plus-animate  minus-human  terminal) 


( np  plus-animate  plus-human) 


A 


( minus-animate  vp ) 
( plus-animate  minus-human  vp ) 
(plus-animate  plus-human  vp) 


-*  ( det  terminal ) 

(noun  plus-animate  plus-human  terminal) 
■*  ( verb  minus-animate  terminal > 

"*  ( verb  plus-animate  minus-human  terminal) 
■♦(verb  plus-animate  plus-human  terminal) 
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The  language  generated  by  this  production-grammar  (and  thus  by 
this  vW  grammar)  consists  of  three  strings  of  terminals,  all  with  deriva- 
tions essentially  alike  except  for  agreement  of  selectional  restrictions: 


<s> 


(Important  note.  The  terminology  and  notation , introduced  in  this 
section  (and  used  in  all  succeeding  sections)  were  devised  especially 
for  this  presentation,  and  do  not  correspond  to  those  in  the  Algol  68 
Report  or  Revised  Report  (van  Wijngaarden  et  al.  1969,  1975);  the  present 
notation  more  resembles  that  used  in  var'i#us  other,  independent,  studies 
(Baker  1972,  Deussen  1975,  Greibach  1974).  But  since  the  Algol  68 
Report  is  the  only  sizeable  example  of  van  Wijngaarden  grammar,  it  is 
nice  to  be  able  tG  read  it;  the  recent  introduction  to  van  Wijngaarden 
grammars  by  Cleaveland  and  Uzgalis  (1977)  is  an  excellent  practice  field 
for  the  Algol  68  Report,  and  the  present  notation  was  chosen  with  an  eye 
to  making  the  transition  as  easy  as  possible. 

(The  Algol  68  Report,  for  example,  uses  the  word  "notion"  in  most 
of  the  places  where  "symbol"  is  used  here;  so  a grammar  consists  of 
meta-notions,  hyper-notions,  proto-notions,  and  so  forth.  Thi9  use  of 
"notion"  is  counter-intuitive.  The  notation  of  the  Algol  68  Report  differs 
in  being  less  redundant;  for  instance,  the  sample  van  Wijngaarden  grammar 
would  be  written: 

(meta-rules: ) 

SUBJ : : ANIMATE. 

ANIMATE::  minus-animate:  plus-animate  HUMAN. 

HUMAN::  minus-human:  plus-human. 
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(hyper-rules : ) 
s:  np  SUBJ,  SUBJ  np. 

np  SUBJ:  det  terminal,  noun  SUBJ  terminal. 

SUBJ  vp:  verb  SUBJ  terminal. 

(No  arrows,  final  periods,  alternates  separated  by  semicolons,  hyper- 
symbols set  apart  by  commas.) 

(This  notation  is  admirably  suited  to  typewriters,  and  it  is  a 
pity  that  it  is  so  hard  to  read  (much  harder,  of  course,  in  larger  and 
more  complicated  grammars).  Rut  ten  years  of  bitter  experience  have 
made  it  clear  that  for  some  reason  people  do  not  immediately  find  the 
Algol  63  Report  notation  helpful.) 

3.3  Chomsky's  Convention  as  a van  Wijngaarden  Grammar 

As  another  example  of  the  notation  and  the  motivation  for  grammars 
with  structured  vocabulary,  we  might  consider  the  "X  Convention"  as 
introduced  by  Chomsky  (1970).  (This  presentation  is  now  completely  out- 
dated, but  it  will  serve  as  an  example  here  since  it  is  likely  to  be 
better  known  than  more  recent  work  in  X syntax.)  The  X convention  is 
introduced  as  part  of  a general  reformulation  of  the  base  component  so 
that  instead  of  unanalyzeable  non-terminal  symbols,  each  non-terminal 
node  will  be  characterized  by  a complex  symbol.  (This  is  in  itself  a 
way  of  introducing  a structured  vocabulary  into  a context-free  grammar, 
a point  to  which  we  will  return  eventually.)  Jackendoff  (1974)  has 
summarized  the  X convention  in  this  way:  "The  general  nature  of  the 
claims  made  by  the  X convention  is  now  clear.  The  structural  schema 
(28)  (below),  in  which  V represents  any  lexical  category,  is  claimed  to 
constitute  a linguistically  significant  generalization  of  the  structures 
associated  with  the  major  categories. 
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Compx 

That  Is,  we  expect  there  to  exist  rules  whose  structural  descriptions 
refer  to  a range  of  structures  including  mote  than  one  value  of  X." 

The  rules  proposed  to  be  included  in  the  base  component  by  Chomsky 
will  employ  "a  variable  standing  for  the  lexical  categories  N,  A,  V" 
which  is  called  X.  "Then  the  base  rules  introducing  N,  A,  and  V will 
be  replaced  by  a schema."  Eventually  Chomsky  arrives  at  rules  which 
are  summarized  by  Jackendoff  as: 

X *+  [Spec,  X]  - X 
X - X - comp 

where  'comp*  is  an  abbreviation  for  some  seqaence  of  nodes,  but  is  not 
itself  a constituent.  Sample  comps  are  "NP,  S,  NP  S,  NP  PrepP,  PrepP 
PrepP,  etc."  (Chomsky  1970),  and  in  the  X schema  above  the  "full  range 
of  structures  that  serve  as  complements"  should  appear." 

As  presented  by  Chomsky,  this  entails  a complete  redefinition  of 
the  base  component  in  ways  which  are  not  made  fully  explicit.  We  can, 
however,  identify  at  least  four  different  devices  at  work:  (1)  use  of  X 
as  a cover  term  for  N,  A,  or  V in  rule  schemata;  (2)  use  of  X,  X,  X to 
indicate  systematic  relatedness  of  categories;  (3)  use  of  fSpec,  Xj  as 
a complex  symbol  which  is  analyzed  differently  depending  on  the  value  of 
X — [Spec,  Nj  as  the  determiner,  [Spec,  V}  as  the  auxiliary,  and  [Spec,  Aj 
perhaps  as  the  system  of  qualifying  elements  associated  with  adjective 
phrases.  "Analyzed"  has  here  a technical  meaning,  namely  that  there 
are  rules 
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£ Spec , N ] — ■ > Det 
C Spec,  V 3 Anx 

and  so  forth;  (4)  use  of  'comp'  (Chomsky  uses  an  ellipsis  ...  instead) 
to  indicate  expansions  into  various  node  sequences,  although  'comp*  is 
not  a constituent. 

Most  of  these  devices  can  be  captured  adequately  in  a van  Wijngaarden 
grammar,  and  so  it  will  be  an  instructive  exercise  tc  recast  the  two 
rule  schemata  of  Chomsky,  and  all  the  accompanying  understandings  about 
how  they  are  to  be  interpreted,  into  van  Wijngaarden  form. 

(1)  It  is  easy  to  find  a way  to  let  X represent  the  category 
symbols  N,  A,  or  V in  a rule,  since  that  is  a chief  use  of  meta-variables. 
All  we  need  is  a meta-rule  defining  X: 

X : : n I a I v 

(2)  It  is  also  easy  to  capture  the  relatedness  of  the  categories 
X,  X,  X by  defining  a meta-symbol  for  bar  and  double  bar  (call  them  EAR 
and  DOUBLE) , and  then  using  hyper-symbols  in  the  hyper-rules  which  are 
composed  of  the  meta-symbol  for  X and  the  meta-symbol  for  one  of  the 
bars  — such  as  (x),  (X  BAR),  (x  DOUBLE)  — so  that  each  hyper-symbol 
contains  a use  of  the  same  meta-variable,  X,  The  separate  question  of 
how  the  categories  so  related  are  rewritten  in  terms  o'  one  another  is 
correctly  captured  in  the  Chomsky  schemata,  and  so  it  transfers  naturally 
to  the  hyper-rules  which  are  also  schemata.  The  first  hyper-rule,  then, 
will  be : 

<X  DOUBLE)  < spec  X BAR>  <X  BAR) 

(3)  Having  Introduced  such  product  ion-symbols  as  (spec  n bar)  or 
(spec  v bar)  under  the  covering  hyper-symbol  of  (spec  X BAR)  , it  is 
perfectly  easy  to  distinguish  among  them  and  to  expand  them  differently 
through  additional  hyper-rules: 


E8 


< spec  n bar ) : < det ) 

< spec  v bar)  : <aux)  < OPTIONALADV) 

< spec  a bar)  : **  <qual) 

(A)  The  remaining  question  of  how  to  write  'comp'  and  yet  avoid 
having  it  be  a constituent  is  the  most  difficult.  Chomsky  used  the  wild- 
card symbol  of  ellipsis  because  rule  schemata  Jo  not  provide  a way  to 
summarize  rules  with  unboundedly  different  numbers  of  symbols  on  the 
right-hand  side,  and  neither  do  van  Wijnganrden  hyper-rules. 

Tt  is  possible,  however,  that  the  content  of  'comp'  should  not 
have  a recursive  definition  with  the  structure  hidden,  but  rather  should 
he  defined  in  terms  of  a finite  number  of  "slots",  each  of  which  can  be 
"filled"  by  various  nodes  or  in  some  cases  by  nothing;  this  style  of 
description  has  often  been  used  by  linguists.  For  exposition,  we  will 
assume  that  a 'comp'  is  an  optional  NP  or  PP,  followed  by  on  NP  or  R. 

This  will  provide  an  exhibition  of  one  way  in  which  optional  elements 
can  be  introduced  into  van  Wijngaarden  grammars. 

The  rule  which  wo  wish  to  write  would  he  represented  in  a common 
notation  for  abbreviating  context-free  rules  ns: 


This  abbreviates  six  rules,  namely  those  with  right-hand  sides 


X NP  NP 

X PP  NP 

X NP 

X NP  S 

X PP  S 

X S 


We  will  now  explain  an  important  additional  convention  which  is 
needed  in  interpreting  van  Wijngaarden  grammars  for  optional  elements 
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and  also  for  other  purposes,  and  then  we  will  apply  that  information 
to  writing  the  rule  above. 

In  a van  Wljngaarden  grammar,  we  can  make  use  of  optional  elements 
by  introducing  a new  meta-symbol  and  meta-rule: 

EMPTY  : : -*  \ 

(EMPTY  has  as  its  expansion  the  null  string,  represented  by  the  lower- 
case lambda  L^l  rather  than  the  usual  upper-case  L A 1 to  preserve  conven- 
tions. ) 

We  can  then  write  meta-rules  utilizing  EMPTY  as  one  of  the 
alternatives,  for  example: 

OPTIONALADV  : : "*  adv  I EMPTY 

and  hyper-rules  to  use  such  meta-symbols,  such  as  the  one  from  the  last 
sub-section: 

(spec  v bar)  s **  <aux>  < OPTIONALADV ) 

Such  a hyper-rule  will  give  rise  to  sub-trees  in  structural  descriptions 
such  as 


( spec  v bar ) 


( aux ) ( adv  > 


and 


( spec  v bar  ) 


( aux  > < EMPTY ) 


(We  will  customarily  show  EMPTY  in  such  trees  rather  than  its  terminal 
empty  string  in  the  meta-grammar , for  ease  of  reading.) 

The  important  convention  is  that  we  interpret  a structural  descrip- 
tion such  as  the  second  one  above  to  he  exactly  the  same  as  another  tree 
in  which  the  EMPTY  and  all  nodes  which  dominate  EMPTY  exclusively  have 
been  pruned  away;  here,  we  identify  the  second  tree  above  with  the 
pruned  tree 
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< spec  v bar  > 


( aux) 

We  will  have  more  important  uses  for  this  convention  in  following  sections, 
since  it  will  permit  the  enhancement  of  using  hyper-symbols  as  "predicates", 
which  increa.*  es  the  naturalness  of  van  Wijngaarden  grammars. 

Using  t'lis  device  for  optionality,  then,  we  can  now  write  rules 
for  the  COMP  elements: 

OPTIONALCQMPA  np  I pp  I EMPTY 

COMPB  : : -*  np  I s 
EMPTY  : : -*  X 

U bar)  <X>  < OPTIONALCQMPA  > < COMPB  > 

Here  the  COMP  meta-symbols  can  be  considered  as  the  names  of  "slots" 
and  their  terminal  meta-expansions  as  "fillers"  for  the  slots;  the  slot 
names  play  no  role  in  trees  generated  by  the  hyper-rule  above,  because 
they  are  replaced  with  actual  node  names  in  production-rules.  (This 
interpretation  is  capable  of  further  historical  and  practical  development, 
since  linguists  have  argued  for  years  about  how  to  incorporate  such 
notions  into  context-free  grammars.) 

We  have  now  completed,  piecemeal,  the  construction  of  van  Wijngaarden 
rules  to  describe  Chomsky’s  two  schemata  and  quite  a number  of  attendant 
informal  understandings.  The  result  looks  like  this. 

van  Wijngaarden  grammar  for  jho_  X convention 
Meta-rules: 


ml . X : : **  n la  I v 

m2.  BAR  •' : "*  bar 

m3.  DOUBLE  : : BAR  BAR 
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» a£u 


A.lh. 


m4. 

OPT IONALCOMPA  : : np 

i pp  1 empty 

ra5. 

COMPB  : : S 1 np 

m6. 

OPTIONALADV  : : -*  adv 

1 EMPTY 

m7. 

EMPTY  : : -*  X 

Hyper-rules: 


hi. 

< X DOUBLE  ) ; -*  < 

spec  X BAR>  < X BAR> 

h2. 

< X BAR  > : **  < X > 

< OPTIONALCOMPA  > < COMPB  > 

h3. 

< spec  n bar ) : "* 

< det ) 

h4. 

< spec  v bar ) : 

<aux>  (OPTIONALADV) 

h5. 

( spec  a bar ) : ** 

< qua!  ) 

The  production  rules  which  can  be  derived  from  the  hyper-rules  (via  the 
Uniform  Replacement  Convention)  are: 


(from 

hi:) 

n 

bar  bar 

spec 

n 

bar 

n bar 

a 

bar  bar 

spec 

a 

bar 

a bar 

V 

bar  bar 

spec 

V 

bar 

v bar 

(from 

hi:) 

n 

bar 

-♦ 

n 

np 

np 

n 

bar 

n 

PP 

np 

n 

bar 

n 

np 

n 

bar 

-* 

n 

np 

s 

n 

bar 

-♦ 

n 

PP 

s 

n 

bar 

n 

s 

a 

bar 

a 

np 

np 

a 

bar 

a 

PP 

np 

a 

bar 

a 

np 

a 

bar 

a 

np 

s 

a 

bar 

a 

PP 

s 

a 

bar 

a 

s 

V 

bar 

-♦ 

V 

np 

np 

V 

bar 

-♦ 

V 

PP 

np 

V 

bar 

V 

np 

V 

bar 

V 

np 

s 

V 

bar 

V 

PP 

s 

V 

bar 

V 

s 

02 


X 

adv 


(from  hJ:)  spec  n bar  -*  det 

(fron  M:)  spec  v bar  -*■  aex 

spec  v bar  -*■  aux 
(from  hS : ) spec  a bar  qual 

It  Is  Indeed  clear  that  several  linguistically  significant  generali- 
zations are  not  asserted  in  these  production-rules,  bat  they  are  asserted 
in  the  rules  of  the  van  Wijngaarden  grammar  (whether  or  not  these  general! 
zatlons  are  correct  is  not  the  point  here;  for  further  Information 
about  the  X convention  see  (Halitsky  1975).  It  is  because  these  linguist! 
caltv  significant  generalizations  are  captured,  that  the  van  Wtjngaarden 
rules  are  much  easier  for  a human  being  to  understand  also. 


3.4  v an  Wijngaarden  Grammars  for  Non-Context-Free  Languages 

With  the  motivation  provided  by  the  foregoing  linguistic  examples, 
we  will  now  Introduce  some  brief  and  schematic  examples  of  van  Wijngaarden 
grammars  for  artificial  languages  so  as  to  give  a better  idea  of  their 
possibilities.  These  examples  are  modeled  after  some  in  the  literature 
(de  Chastellier  and  Colmerauer  I960,  Cleaveland  and  Urgalis  1977),  and 
Illustrate  techniques  which  are  common  In  van  Wijngaarden  grammar  writing 
hut  which  are  not  readily  apparent  to  one  accustomed  to  single-level 
grammars. 


Our  first  example  will 
{ / I 'i'll  - that 


he  a grammar  for  the  set  of  strings 
is,  the  language  of  all  strings  of  cj’s 


followed  by  an  equal  number  of  b' s,  or  <t  b,  aab}\  ,uiabbb,  ... 

The  usual  context-free  grammar  for  this  language  is  given  as 

i 

s -*  a S b lab 


which  produces  derivations  such  as 
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> 

i 

t 

i 


There  Is,  of  course,  no  reason  not  to  write  such  a grammar  directly  as  a 
van  Wijngaarden  grammar  with  a null  meta-grammar  component;  hence, 

< s ) : i a terminal ) < 3 ) ( b terminal ) I 

< a terminal ) < b terminal ' 

In  either  notation,  the  number  of  a*s  and  b's  is  controlled  by 
how  many  times  the  first  alternative  of  the  rule  is  used,  and  the 
equality  constraint  is  enforced  because  each  rule  application  introduces 
exactly  one  terminal  a and  also  exactly  one  terminal  b. 

But  there  is  another  way  of  getting  the  same  language  from  a van 
Vijngaarden  grammar,  which  is  less  straightforward  but  which  generalizes 
as  the  grammar  above  does  not.  An  alternative  van  Wijngaarden  grammar 
for  { an  bn  | n > 1 ) is 

Meta- rules: 

ml.  N : : -*  n I N n 
m 2.  AB  : : a lb 

Hyper-rules: 

hi . ( s > : -*  < N a > IN  b- 

h2.  in  N AB'  V AB  terminal'  (n  ab' 

h3.  in  .Afl)  ; < AB  terminal ) 
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This  will  take  a bit  of  study,  but  the  idea  la  basically  staple  and  is 
useful  in  many  van  Wljngaarden  grammars.  Notice  first  that  meta-rule  1 
(ml  above)  specifies  recursively  any  number  of  n's  — meta-  iymbol  N 
derives  as  terminal  string3  of  proto-symbols  n,  nn,  nnn,  nnnn,  nnnnn, 
and  so  on.  Thus,  meta-symbol  N used  as  a start  symbol  in  the  meta-grammar 
yields  an  infinite  set  of  values;  and  that  in  turn  means  that  when  N 
is  used  in  a hyper-rule  such  as  hi  above,  that  the  set  of  production- 
rules  which  can  be  made  from  the  hyper-rule  schema  is  also  infinite, 
one  rule  for  every  possible  value  of  N.  The  first  few  production  rules 
manufactured  from  hyper -rule  hi  above  would  be: 

< s > ( na ) \ nb  > 

< s ) "*  ( nns  > \ nnb  > 

( s ) < nnna  ) ( nnnb  > 

( s ) ( nnnna  > ( nnnnb  1 

( ) { nnnnna  > v nnnnnb ) 

« 

a 

and  so  on.  This  immediately  changes  the  theory  of  the  rules  that  wo 
are  accustomed  to,  because  now  the  language  of  the  van  Wljngaarden 
grammar  is  specified  by  an  infinite  number  of  production-rules  (whereas 
it  is  a basic  requirement  of  ordinary  context-free  grammars  that  the 
set  of  rules  should  be  finite). 

The  second  hyper-rule  also  used  the  meta-symbol  N in  an  essential 

way: 


h».  u S AB)  ' AB  terminal ' v N AB  ' 

This  again  abbreviates  an  infinite  number  of  production  rules,  which 
provide  (in  general)  that  a symbol  composed  of  a certain  number  of  n’s 
plus  an  a or  b,  can  be  rewritten  as  an  a-or-b-termlnal  followed  by  a 
symbol  containing  one  fewer  N's  than  the  left  side  of  the  rule.  (The 
meta-symbol  AB  in  the  rule  is  Just  a cover  term  for  a or  b»)  It  is 
the  I’niform  Replacement  Convention  which  makes  this  true;  the  URC 
assures  that  the  same  value  for  N (am!  for  AB)  is  inserted  into  both 


sides  of  the  rule;  since  there  Is  an  extra  explicit  'n*  on  the  left,  the 
production-rules  generated  by  this  hyper-rule  will  be  of  the  form: 


< nna  > *♦  < a 

terminal > 

( na > 

< nnb  > -*■  ( b 

terminal ) 

( nb> 

( nnna ) ( a 

terminal > 

< nna ) 

( nnnb ) **  ( b 

terminal ) 

< nnb) 

f nnnna ) a 

terminal > 

\ nnna ) 

< nnnnb)  "*<  b 

terminal > 

( nnnb ) 

and  so  on  for  all  the  values  of  N.  Each  such  rule  in  effect  casts  off 
a terminal  and  leaves  a new  non-terminal  which  can  be  re-wrlttcn  by  a 
previous  rule.  The  final  hyper-rule,  hT, 

h3.  <n  AB>  ( AB  terminal  > 

simply  provides  for  handling  the  final  case,  the  'shortest'  non-terminals 
produced  by  rules  of  hyper-rule  1»2. 

So  a sample  derivation  in  this  van  U’ijngaarden  grammar  will  look  like 


< a 


terminal)  <nna)  (b  terminal)  \nnb> 


< a terminal > 


< b terminal > 
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In  this  grammar  the  number  of  a's  and  b's  is  determined  in  an  entirely 
different  way  from  the  context-free  grammar:  here  the  number  of  a's 
and  b's  depends  not  on  how  many  times  a recursive  rule  is  applied 
to  rewrite  S H^a  S b,  but  rather  on  which  of  the  Infinite  number  of 
expansions  of  < s ) is  Initially  chosen  and  applied  once.  The  equality 
constraint  is  likewise  enforced  in  an  entirely  different  way:  here  the 
equality  depends  not  on  the  fact  that  one  a and  ona  b are  added  on 
each  recursive  expansion,  but  rather  is  enforced  by  the  Uniform  Replacement 
Convention,  which  assures  agreement  between  the  two  non-terminals  intro- 
duced by  the  first  rule  application  ("agreement"  here  meaning  that  they 
initiate  parallel  chains  of  rules  in  the  grammar  to  generate  the  sane 
number  of  terminals). 

There  would  be  no  point  in  going  through  all  of  this  for  the 
^ ‘ y ‘ | >:  3 1 | language,  since  that  language  has  a simpler  context- 

free  grammar.  But  the  approach  of  the  second  van  Wijngaarden  grammar 
generalir.es  in  a way  that  the  approach  of  the  first  one  does  not.  If 

we  now  wish  to  have  a language  £ a*1  bn  i n i 7 } * — that  ia,  any 

number  of  a's,  followed  by  the  same  number  of  b's,  and  again  an  equal 

number  of  c's,  or  abc,  aabbcc,  aaabbbccc,  aaaabbbbcccc,  etc.  — the 

simpler  scheme  breaks  down.  This  now  language  is  one  which  is  well- 
known  to  be  not  context-free,  meaning  that  it  is  not  generated  by  any 
context-free  grammar.  (It  is  not  hard  to  see  why  this  is  so;  there 
are  only  two  sides  to  a center- embedded  symbol,  so  a context-free  grammar 
can  coordinate  only  two  strings  at  a time,  and  those  must  be  mirror- 
images  of  each  other.) 

But  there  bs  a van  Mi jngaarden  grammar  for  the  language  { a*  bn  cP  l 
*:  > : $ and  it  is  only  trivially  different  from  the  grammar  for  the 
preceding  one.  The  only  difference  we  need  to  introduce  is  to  mnke  a 
new  meta-symbol  ABC  to  be  a cover  symbol  for  a,  or  b,  or  c,  and  we  also 
need  to  introduce  the  three  items  in  the  first  hyper-rule  instead  of 
only  two. 

The  resulting  grammar  ia: 
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Meta-rules: 


N : ! n I N n 


ABC  * :-*•  a lb  I c 


Hyper-rules: 


( s>  < N a>  < N b>  < N c> 

(n  N ABC ) : ( ABC  terminal)  ABC ) 

( n ABC ) : *♦  < ABC  terminal ) 


with  derivations  such  as: 


( nnna) 


< nnnb ) 


< nnnc) 


< a terminal ) < nna ) ( b terminal ) < nnb ) < c terminal ) ( nnc ) 


/ 

< c term; 


< a terminal ) ( na ) ( b terminal > < nb ) 


< c terminal > < nc  > 


< a terminal ) 


( b terminal) 


< c terminal > 


While  it  is  true  that  counting  of  this  sort  is  not  precisely  a phenomenon 
of  natural  language,  this  example  demonstrates  that  other  matters  of 
agreement  which  are  not  naturally  handled  by  context-free  rules  may 
nevertheless  be  handled  simply  by  a van  Wijngearden  grammar. 


3.5  van  Wljngaardcn  Grammars  with  Hyper-symbols  as  Predicates 


We  will  now  consider  a device  which  can  be  seen  as  merely  a matter 
of  style  in  writing  van  Wijngaarden  grammars,  but  which  opens  up  surprising 
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possibilities  of  gaining  the  effect  of  "tests"  or  conditions  on  properties 
of  symbols  while  remaining  wholly  within  the  original  syntactic  frame- 
work. This  possibility  was  not  used  in  the  original  Algol  68  Report, 
but  w«»  incorporated  for  the  first  time  in  the  Revised  Report;  this 
suggests,  correctly,  that  the  idea  was  not  entirely  obvious,  even  to 
A.  van  Wijngaarden  himself.  But  it  is  very  simple,  and  although  it  is 
in  some  sense  a trick,  It  is  s satlsfylngly  elegant  trick.  The  basic 
notion  Involved  is  to  Introduce  extra  hyper-symbols  into  hyper-rules, 
and  to  arrange  that  the  other  hyper-rules  should  either  derive  EMPTY  from 
the  additional  symbols,  or  else  should  fail  to  derive  any  terminal 
string,  thus  leading  to  a blocked  derivation.  But  this  explanation 
give  no  idea  of  how  the  idea  is  used,  and  we  shall  now  develop  that  slowly. 

Let  us  begin  with  an  example  of  a van  Wljngaarden  grammar  to 
generate  all  strings  of  double  letters  — the  (finite)  language  ’na', 

*bb*,  ’cc',  ....  'yy*.  '«'•  A perfectly  adequate  van  Wljngaarden  grammar 
would  be; 


Meta-rule: 


ALPHA  a I b I c I d I e I flglhl  il  jlk!  1 I ml  n I o I pi 

qlrlsl  tlulvlwlxlylz 


Hyper-rule: 


(a)  :•*  (ALPHA  terminal)  (ALPHA  terminal) 


This  grammar  la  correct  because  the  Uniform  Replacement  Convention 
assures  that  the  hyper-rule  represents  exactly  26  production  rules  such  as 

(a)  **  ( a terminal)  <a  terminal) 

<s)  -*(b  terminal)  <b  terminal) 

and  ao  on.  Only  strings  where  both  letters  are  the  same  are  generated 
by  the  grammar,  because  only  rules  with  both  occurrences  of  ALPHA 
replaced  by  the  same  letter  are  available.  We  will  have  derivations 
such  as 
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terminal)  <b  terminal) 


That  grammar,  ns  we  said,  is  perfectly  satisfactory,  but  now 
consider  this  longer  approach  to  defining  the  same  language  of  double 
letters: 

Meta- rules: 

ml . ALPHA  5 ! a I b I c I ...  I x I y I z 

m2.  ALPHAl  : ! ALPHA 

m3.  ALPHA2  ! '•  ALPHA 

m4 . EMPTY  •* : -*  X 

Hype r- rules: 

hi.  <s)  ’•*  ( ALPHAl  terminal)  < ALP  HA 2 terminal)  ( ALPHAl  ALP HA 2 ) 

h2.  < ALPHA  ALPHA ) ; - < EMPTY > 

This  grammar  relies  crucially  on  the  way  the  Hntform  Replacement 
Convention  is  understood  to  work.  The  URC  says  that  in  any  one  rule, 
all  occurrences  of  the  same  meta-symbol  must  be  replaced  by  identical 
terminal-strings  generated  in  the  meta-rules  by  the  meta-symbol;  but 
different  meta-symbols  may  be  replaced  with  different  terminal  strings. 

In  particular,  in  hyper-rule  1 above  ALPHA!  and  ALPHA?  are  different 

meta-symbols;  they  may  be  replaced  independently  with  the  same  values 

u 


jj 


too 


or  with  different  value*,  but  repeated  instances  of  ALPHA!  or  repeated 
instances  of  ALPHA2  must  be  replaced  consistently  with  the  same  string 
in  every  repeated  instance.  (Tn  the  meta-rules  above,  obviously  ALPHA! 
and  AIP11A2  have  been  Introduced  — both  directly  deriving  ALPHA  — for 
this  very  purpose,  to  have  "synonyms"  for  ALPHA  which  are  different  to 
the  HRC,  In  future  grammars,  we  will  assume  without  explicit  mention 
that  all  meta-symbols  ending  in  single  digits  like  these  are  introduced 
as  aynonyms  for  meta-symbols  without  digits,  as  in  meta-rules  2 and  3 
above . ) 

tty  the  i'RC,  then,  from  the  first  hyper-rule  we  will  get  production- 
rules  like  the  following: 


<a> 

v a terminal  > 

< a 

terminal ) 

< a 

a) 

<i> 

( a terminal ) 

< b 

terminal > 

( A 

b> 

<•> 

**  < a terminal ) 

(c 

terminal  1 

< a 

c ) 

i s ) 

- < b 

terminal ) 

v a 

terminal > 

lb 

a^ 

< a ' 

- < b 

terminal > 

<b 

terminal > 

< b 

b) 

( s ) 

( n 

terminal > 

< c 

terminal > 

< b 

c> 

1 a > 

**  < o terminal ) 

1 a 

terminal > 

<c 

a > 

<•> 

**  < c terminal  ) 

<b 

terminal l 

<0 

b ) 

< a ) 

•*  < o terminal  ^ 

( c 

terminal ■ 

<c 

a ) 

The  first  column  of  terminals  and  the  first  letter  in  the  pair  at  the 
end  both  come  from  replacing  ALPHA 1 ; tho  second  column  of  terminals 
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and  the  second  letter  in  the  pairs  at  the  end  both  come  from  replacing 
ALPIIA2;  so  replacement  values  for  the  same  meta-symbol  always  agree, 
but  replacement  values  for  the  different  meta-symbols  are  chosen  in- 
dependently. 

We  have  dramatically  Increased  the  number  of  production-rules 
used  to  define  the  language.  In  the  previous  grammar  of  double  letters, 
we  had  26  production-rules  in  all;  now  in  this  grammar  we  have  26  times 
26  production  rules  (676  rules)  from  the  first  hyper-rule  alone,  and  all 
the  pairs  of  letters  that  we  do  not  want  are  being  generated  so  far. 

We  correct  for  this,  and  discard  all  the  pairs  of  letters  that 
do  not  agree,  with  the  second  hyper-rule: 

< ALPHA  ALPHA  > : “►  < EMPTY  > 

again  by  the  URC  the  two  instances  of  ALPHA  must  be  replaced  by  the 
same  string  of  proto-symbols,  so  this  rule  underlies  just  an  additional 
26  production  rules  of  the  sort: 

< a a)  -►(EMPTY) 

< b b > -*  < EMPTY  > 

and  so  forth.  Accordingly,  the  second  hyper-rule  will  provide  for 
re-writing  the  pairs  of  letters  generated  by  the  first  hyper-rule,  Just 
in  case  they  are  the  same  letter.  We  will  have  derivations  such  as: 


<s> 


In  Juat  the  cases  we  want,  then,  the  strings  in  which  the  same  terminal 
is  repeated,  we  will  get  a tree  like  this  one  ( EMPTY  produces  the 
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terminal  null  string  X , but  by  our  agreement  we  have  Instead  the  meta- 
symbol EMPTY).  Since  the  terminal  string  of  the  third  non-terminal  aa 
is  the  null  string,  we  have  the  correct  string  of  terminals;  and  in  accord- 
ance with  the  convention  explained  earlier  that  all  nodes  In  a tree  which 
dominate  only  EMPTY  are  pruned  (that  the  tree  Is  Identified  with  another 
tree  which  Is  the  same  except  that  those  nodes  are  missing),  we  will 
treat  the  structural  description  above  as  equivalent  to  the  structural 
description 


< a terminal > ( a terminal > 

and  so  both  the  string  and  the  tree  are  what  we  want. 

So  much  for  how  the  strings  of  double  letters  are  generated,  which 
was  the  language  to  be  defined.  Now  in  very  many  cases  the  partial 
derivations  produced  by  our  grammar  will  not  be  like  the  ones  we  have 
inspected.  Instead,  they  will  be  like  the  following: 


This  is  a very  different  situation.  The  production-symbol ^ a b ^ is 
not  a terminal,  obviously,  because  it  does  not  end  with  the  proto-symbol 
'terminal'.  It  is  a non-terminal , but  the  grammar  provides  no  way  to 
re-write  that  non-terminal  symbol;  there  is  no  production-rule  with  the 
production-symbol  <(  a b^  on  the  left  side  (because  hyper-rule  2 only 
provides  production-rules  for  rewriting  such  symbols  when  they  are 
composed  of  the  same  character  twice).  So  this  last  tree  is  not  a 
structural  description  underlying  any  string  in  the  language  of  the 
grammar.  This  attempted  derivation  Just  "blocks"  since  it  cannot  be 
completed  into  a string  of  terminals. 
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Of  the  production-rules,  650  rules  introduce  such  pair-symbols 
which  are  not  matched;  only  26  introduce  double  letters.  The  26  production- 
rules  of  the  second  hyper-rule  then  re-write  the  double-letter  symbols 
as  EMPTY.  The  650  rules  always  introduce  non-terminal  symbols  which 
cannot  be  rewritten,  thus  blocking  a derivation  in  a "blind  alley."  Thus, 
this  second  grammar  also  defines  just  the  language  of  the  26  pairs  of 
double  letters,  the  same  language  as  the  first  grammar  did. 

Although  in  this  example  the  second  grammar  is  more  cumbersome 
than  the  first,  the  general  strategy  is  useful  in  more  complicated 
grammars:  it  is  often  clearer  to  write  a hyper- rule  so  that  <t  gives 
rise  to  unwanted  production-rules,  and  then  "restrict"  those  production 
rules  by  failing  to  provide  additional  production-rules  to  rewrite  all 
the  symbols  Introduced. 

The  name  "predicate"  is  used  to  describe  hyper-symbols  such  as 
^ ALPHA1  \LPHA2^  which  are  inserted  only  to  either  (a)  yield  EMPTY 
and  disappear,  or  (b)  yield  a "blind  alley"  and  block.  It  Is  possible 
to  be  more  suggestive  by  adding  some  extra  proto-symbols  to  u predicate 
as  window-dressing.  For  example.  Instead  of  ^ AI.P1IA1  A1.P1IA2  ^ , we  might 
toss  in  two  extra  proto-symbols  and  make  the  hyper-symbol  ^ where  AI.PHA1 
is  AT.PHA2  / . This  would  let  us  rewrite  the  hyper-rules  of  the  last 
grammar  as 

Hypar-rules: 

hi.  <s>  ( ALPHAl  terminal)  < ALPHA2  terminal) 

(where  ALPHAl  is  ALPHA2 ) 
h2.  < where  ALPHA  is  ALPHA)  '•  *♦  < EMPTY) 

but  this  does  not  change  the  method  in  the  least,  it  only  gives  more 
suggestive  hyper-symbols  and  the  ability  to  define  more  than  one  predicate 
with  the  same  meta-symbols.  One  might  have 


( where  ALPHA1  is  ALPHA2 ) 


< where  ALPHAl  is  not  ALPHA2 > 

< where  FEATURES 1 are  contained  in  FEATURES 2 ) 

< where  FEATURES 1 nonconflicts  with  FEATURES 2 ) 

and  so  on,  each  of  which  would  block  in  different  circumstances. 

It  is  required  that  all  such  predicate  hyper-symbols  be  defined 
in  strictly  syntactic  terms,  and  we  have  not  really  shown  yet  how  this 
is  done.  (The  example  of  {where  ALPHA  is  ALPHA  ^ is  hardly  representative, 
because  it  uses  the  URC  to  immediately  take  over  all  the  work.)  How  to 
write  the  syntactic  rules  to  make  the  predicates  have  their  intended 
effect  is  rather  specialized,  and  for  the  present  it  is  sufficient  to 
believe  that  a great  deal  is  possible.  As  an  aid  in  acquiring  such  a 
belief  we  will  go  Just  one  step  farther  here,  and  define  a simple  but 
not  trivial  predicate  hyper-symbol.  Detailed  examples  are  worked  out 
at  length  in  sections  5.1  and  5. A below. 

For  this  example,  we  will  define  a van  Wijngaarden  grammar  to 
generate  a language  Just  the  opposite  of  the  last  one  — this  time,  we 
will  have  the  language  of  pairs  of  letters  which  are  not  the  same. 

(Tills  is  again  a finite  language:  ab,  ac,  ac,  ...,  ay,  at,  ba,  be ) 

We  will  use  the  same  meta-rules  again  (tills  time  omitting  the  rules  for 
ALPHAl  and  A1.FUA2  in  accordance  with  our  convention  that  they  are 
assumed  to  be  synonyms  for  ALPHA),  but  we  will  need  additional  meta-rulea 
for  STRINCs  and  new  hyper-rules. 

Meta-rules : 

ml . ALPHA  a I b I c I ...  I xlylz 

m2 . EMPTY  : : - X 

m3.  STRING  : : -*  ALPHA  I STRING  ALPHA 
m4.  OPTSTRING  • : STRING  I EMPTY 

(A  STRING  is  just  one  or  more  ALPHAS,  and  an  OPTSTRING  is  zero  or  more 
ALPHAS.) 
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Hyper-rules ; 


hi,  <s>  : -*  < ALPHA!  terminal > < ALPHA2  terminal) 

( where  ALPHAI  is  not  ALP HA 2 ) 
h2 . ( where  aLPHAI  is  not  ALPHA2 ) : *+ 

( where  ALPHAI  precedes  ALPHA2  in  abcdefghi jklmnopqrstuvwxyz ) I 
( where  ALPHA2  precedes  ALPHAI  in  i&cdefghijklmnopqrstuwxyz ) 

h3.  < where  ALPHAI  precedes  ALPHA2  in  0PTSTRING1  ALPHAI 
OPTSTRING2  ALPHA2  0PTSTRING3  ) : -*  < EMPTY > 

The  length  of  the  hyper-symbols  In  the  hyper-rules  makes  them  fit  the 
lines  badly,  but  careful  attention  will  sort  them  out.  Rule  hi  rewrites 
^s)  as  a terminal,  followed  by  another  terminal,  followed  by  a predicate. 
Rule  h2  rewrites  a predicate  as  either  of  two  alternative  more-primitive 
predicates.  Rule  h3  rewrites  a single  monstrously-long  left-side  symbol 
as  EMPTY;  note  In  this  last  rule  h3  that  everything  but  QiPTY  Is  In  a 
single  pair  of  angle-brackets  and  Is  on  the  left  side  of  the  rule. 

Clearly  the  basic  idea  of  this  grammar  Is  the  same  as  the  last  one, 
because  In  the  first  hyper-rule  the  only  element  which  is  changed  la 
the  final  predicate;  as  before  we  generate  two  terminals  and  a restriction 
on  them,  but  now  the  restriction  Is  the  reverse  of  what  It  was. 

Rule  h2  deflnea  what  It  means  for  one  ALPHA  to  "be  not"  another 
ALPHA,  by  saying  that  It  means  either  that  the  first  precedes  the  second 
In  the  alphabet  or  else  the  second  precedes  the  first  in  the  alphabet. 

(If  neither  of  these  Is  true,  then  they  are  the  same  letter.) 

Rule  h3  then  defines  what  It  means  for  one  character  to  precede 
another  In  a string,  by  saying  that  If  character  I precedes  character  2 
In  the  string,  then  It  will  be  possible  to  divide  the  string  into  five 
parts  so  that  the  first  part  Is  zero  or  more  characters,  the  second  part 
Is  character  1,  the  third  part  is  zero  or  more  characters,  the  fourth 
part  is  character  2,  and  the  fifth  part  is  zero  or  more  characters. 

But  the  summary  just  given  of  the  hyper-rules  is  not  quite  adequate, 
because  the  effect  described  is  achieved  not  by  some  kind  of  process  of 
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trying  a string  and  seeing  If  it  can  be  divided,  and  ao  on,  but  purely  in 
terra  of  symbols  and  rewrite  rules.  A sample  derivation  in  this  grammar 
could  be 


<s> 


(where  c precedes  f in  abcdefghijklmnopqrstuvwxyz ) 

< EMPTY) 

We  know  that  there  is  a production-rule  which  produces  EMPTY  from  that 
last  symbol,  because  it  comes  from  the  last  hyper-rule: 

(where  ALP HA 1 precedes  ALPHA2  in  0PTSTRING1  ALPHAl  0PTSTRING2  ALPHA2  OPT- 
STRING3  y 

where  c precedes  f in  ab  c de  f ghljkl... 

ALPHAl  and  ALPHA2  are  replaced  by  the  same  values  at  their  repeated 
appearances,  and  0PTSTRING1,  0PTSTRING2,  and  0PTSTRIN(f3  are  Independently 
chosen  as  sequences  of  zero  or  more  characters;  therefore,  this  must  be 
a production-rule  validly  produced  from  hyper-rule  3.  Prolonged  inspection 
will  convince  the  interested  reader  that  when  the  two  characters  are  the 
same  (ALPHAl  and  ALPHA2  are  the  sane  terminal  proto-symbols)  and  the  string 
Introduced  by  hyper-rule  2 is  the  alphabet,  there  is  no  way  that  any  values 
can  be  chosen  to  make  a production-symbol  which  is  also  on  the  left  of 
any  production-rule  generated  by  hyper-rule  3,  and  hence  when  the  two 
characters  are  the  same  the  derivation  will  lead  to  a blind  alley. 

Now,  once  two  letters  can  be  distinguished,  then  two  strings  of 
letters  can  be  distinguished  (one  letter  at  a time),  and  so  any  information 
which  can  be  coded  into  strings  (that  is,  any  information)  can  be  manl- 
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pulated.  The  details  of  this  manipulation  are  fascinating,  and  to  an 
enthusiast  may  ba  tha  most  intarasting  facet  of  van  Vljngaardan  grammars; 
it  is  satisfying  to  ba  able  to  define  the  predicates  with  no  additional 
machinery  whatsoever,  (van  Wljngaarden  1974  contains  a two-level  grammar 
to  simulate  a Turing  machine,  using  9 meta-rules  and  30  hyper-rules.) 

As  a practical  matter,  though,  the  exact  definition  of  the  predicates 
is  of  secondary  Importance.  Once  basic  predicates  are  defined,  then 
they  can  be  used  in  one  grammar  after  another  without  changes  (something 
like  a subroutine  library).  Also,  the  predicates  would  never  be  used  in 
a computer  program  as  they  are  in  the  grammars  because  they  would  be 
implemented  instead  with  primitive  tests  for  identity,  non-identity, 
etc.,  outside  the  grammatical  apparatus.  Moreover,  since  anything  can 
be  coded  as  a predicate  hyper-symbol,  there  is  no  actual  restriction  on 
expressive  power  in  requiring  that  the  definition  be  syntactic.  It  la 
important,  however,  that  the  URC  restricts  the  predicate  to  appearing 
in  the  very  hyper-rule  where  the  items  to  be  restricted  are  Introduced; 
we  gain  'locality  of  definition'  when  the  restriction  is  on  nodes  intro- 
duced by  a single  rule,  rather  than  on  global  tree  configurations. 

4,  Prior  Linguistic  Uses  of  Grammars  with  Structured  Vocabulary 

It  is  immediately  suggestive  that  there  have  been  several  links 
between  the  development  of  the  van  Wljngaarden  grammar  formalism  which 
we  have  been  considering  and  grammars  of  natural  languages.  Adrlaan  van 
Wljngaarden  himself,  at  the  Mathematlsch  Centrum,  applied  its  earliest 
computing  machinery  to  a study  of  newspaper  Dutch,  saying  that  "we  hope 
that  this  and  similar  information  to  be  obtained  In  the  future  will 
help  us  to  get  better  insight  in  the  formal  properties  of  our  language" 

(van  Berckel  et  al.  1965),  and  since  then  has  shown  an  interest  in  natural 
languu&t  analysis  using  Algol  68  (van  Wljngaarden  1970,  Smith  1976). 

C.  U.  A.  Koster,  an  author  of  the  Algol  68  Report  and  an  Influential 
advocate  of  the  van  Wljngaarden  descriptive  method,  had  first  used  a 
similar  formalism  in  writing  a context-free  grammar  of  English  in  1962 
(see  Koster  1965).  The  author  of  a technical  note  on  van  Wljngaarden 
grammars  (Deussen  1975)  makes  reference  to  a doctoral  dissertation  of 
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the  same  period  which  consists  of  a large  surface  graanar  of  German 
utilizing  essentially  the  van  Wijngaarden  graanar  format  (Schneider  1965, 
1966).  And  one  of  the  very  earliest  formalizations  of  the  van  Wijngaarden 
method  was  undertaken  at  the  Universite  de  Montreal  for  use  in  a French- 
English  machine  translation  project  (de  Chastellier  and  Colmerauer  1969). 

It  seems  likely  that  this  repeated  invention  has  been  prompted  by 
necessity  when  context-free  grammars  are  used  to  describe  natural  languages. 
Evidence  that  this  is  so  is  forthcoming  when  one  begins  to  consider  other 
previous  linguistic  work  in  light  of  the  model  of  grammars  with  structured 
vocabularies  provided  by  van  Wijngaarden  grammars,  and  finds  repeatedly 
the  same  essential  use  of  structured  vocabulary  as  a natural  part  of  the 
descriptive  techniques  used. 

Since  it  is  not  at  once  obvious  exactly  how  to  go  about  making 

best  use  of  structured  vocabulary  in  describing  natural  languages,  it 

is  reasonable  to  examine  the  independent  approaches  toward  structured 

vocabulary  which  linguists  have  needed.  The  uses  exemplified  may  be 

instructive,  even  when  the  understanding  is  as  frustrated  but  as  tantall- 

zlngly  close  as  in  this  passage  where  Hockett  (1966)  describes  a "compo- 

nential  alphabet"  on  two-levels: 

A simple  example  la  Potawatoml  noun  stems  (N)  which  are  either 
animate  (An)  or  inanimate  (In)  in  gender  (Gn),  and  also  either 
independent  (Ind)  or  dependent  (Dep)  in  dependency  (Dp).  ... 

One  could  provide  for  the  whole  situation  in  a single  composite 
rule  subsuming  four  elementary  rules: 

N N An Ind,  NAnDep,  NInInd,  NInDep 

(Footnote  63:)  I am  not  sure  how  the  procedure  developed  here 
would  fit  into  the  rest  of  the  grammar.  I am  not  sure  whether 
my  notations  such  as  'NAnlnd'  are  single  characters  or  strings 
of  characters;  perhaps  indeed  one  here  needs  to  use  a componential 
alphabet  so  tluit  NAnlnd  (and  the  like)  can  be  a single  character 
but  with  components  susceptible  to  separate  manipulation." 

Hockett* s puzzlement  in  his  footnote  appears  to  have  been  shared  by  many 

others  who  felt  that  phrase-structure  grammars  missed  some  essential 

features  of  natural  languages,  and  surprisingly  often  it  turns  out  to 

be  possible  to  interpret  the  missing  element  as  provision  for  structured 

vocabulary. 
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4.1  Pre-Chomsklan  Uses  of  Structured  Vocabulary 

Although  there  is  little  doubt  that  diligence  could  locate  the 
true  source  of  two-level  grammars  In  Panlni,  we  will  for  brevity  in 
this  initial  essay  begin  with  the  American  linguists  who  immediately 
preceded  Chomsky.  Their  usage  of  structured  vocabularies  in  describing 
natural  languages  may  be  most  conveniently  studied  in  Chomsky's  own 
early  manuscript  The  Logical  Structure  of  Linguistic  Theory  (1955/1975) 
and  in  Paul  Postal's  provocative  demonstration  that  a whole  range  of 
descriptive  methods  amount  to  very  little  more  than  phrase-structure 
grammars  (Postal  1964b).  These  summaries  are  interesting  enough,  however, 
to  auggest  that  additional  first-hand  consideration  could  be  rewarding. 

A lengthy  development  related  to  the  ideas  of  Zellig  Harris  is 
provided  by  Chomsky  (1955/1975),  which  seems  to  be  of  chief  importance 
in  understanding  the  ideas  about  structured  vocabulary  which  played  a 
role  in  Chomsky's  own  early  theories.  There  the  motivation  for  structured 
vocabulary  arises  from  an  elaborate  scheme  to  establish  "grammatical 
categories"  so  as  to  explain  "degrees  of  grammaticalness"  (i.e.,  why 
"of  of  the  of"  is  less  grammatical  than  "colorless  green  ideas  sleep 
furiously"  — at  this  period,  however,  the  second  of  these  strings  was 
considered  to  be  grammatical,  a status  which  it  was  later  to  lose). 

Chomsky  describes  a process  of  hierarchical  sorting,  or  clustering 
which  establishes  many  extremely  tiny  grammatical  categories  containing 
only  one  or  a few  lexical  items  as  members,  then  groups  some  of  these 
into  larger  categories,  some  of  the  resulting  categories  into  still 
larger  categories,  etc.  Finally,  there  is  an  evaluation  procedure  (whose 
details  are  unknown,  as  la  usual  with  evaluation  procedures)  by  which 
to  choose  the  level  of  grouping  which  is  optimal  for  describing  the 
language,  and  this  level  so  chosen  is  called  the  "absolute  analysis." 
Chomsky  explains: 

The  absolute  analysis  embodies  the  major  grammatical 
restrictions.  Presumably  these  will  be  stated  in  terms  of  such 
classes  as  Noun,  Verb,  Preposition,  etc.  There  will  then  be 
many  further  grammatical  restrictions  that  have  to  do  with  limited 
and  special  contexts,  and  that  will,  presumably,  be  reflected  in 
superior  degrees  of  grammaticalness  (i.e.,  smaller,  lower-order 
categories).  These  further  restrictions  correspond  in  part  to 
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what  Harris  has  called  selection.  Thus  selections!  restrictions 
can  be  defined  as  those  which  refer  to  an  account  of  graomaticalness 
which  is  more  detailed  and  specific  than  that  provided  by  the 
absolute  analysis.  Although  Preposition  may  well  turn  out  to  be 
a class  of  the  absolute  analysis  for  English,  there  will  be  sub- 
classes of  prepositions  that  occur  with  different  nouns  and 
verbs,  etc.... 

The  difficulty  is  that  categories  of  different  sizes  on  different  levels 
may  simultaneously  make  different  linguistically-signifi'rant  generalisa- 
tions: 

There  will,  for  Instance,  be  various  strings  which  we  would  like 
to  say  are  noun  phrases,  even  though  they  do  not  all  appear 
grammatically  (with  first-order  grammaticalness)  as  subjects  of  the 
same  verb  phrases,  although  each  occurs  grammatically  with  some 
verb  phrase. 

It  is  the  desire  to  keep  multiple  levels  of  generalisation  which  leads 
eventually  to  treating  these  category  symbols  as  complex  symbols  (as  in 
section  3.3  above),  so  that  the  complex  similarities  and  differences 
can  be  preserved. 

Some  understanding  akin  to  the  one  just  outlined,  that  various 
grammatical  categories  were  alike  for  some  purposes  but  different  for 
others,  seems  to  have  been  general  among  American  linguists  of  the  period. 
This  understanding  was  of  principal  use  in  two  situations:  (1)  describing 
the  occurrence  of  items  whose  environments  were  nearly  identical,  and 
(2)  describing  agreement  phenomena,  or  perhaps  more  generally  ''discon- 
tinuous constituents".  Although  it  is  sometimes  difficult  to  sharply 
separate  these  two,  they  do  seem  to  be  different  uses. 

A convenient  example  of  the  first  kind  of  use  is  provided  by  Zellig 
Harris.  Harris's  formulas  Introduce  several  kinds  of  description  re- 
miniscent of  two-level  description.  In  one  of  these,  symbols  are  given 
numbers  and  "each  higher  numbered  symbol  represents  all  lower  numbered 
symbols,  but  not  vice-versa"  (Harris  19S1).  For  instance,  the  formula 
N * - S - N*  also  represents  the  additional  formula  N * - S • N ^ ; 

but  such  generalization  only  occurs  on  the  left  si  les  of  formulas  (the 
right  sides  of  rules).  This  system  would  be  modeled  by  a van  Wijngaarden 
grammar  which  made  use  of  meta-symbols  on  only  one  side  of  hyper-rules. 
E.g.  • 
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NTHREE  •' : *♦  nthree  I NTWO 
NTWO  : : -*  ntwo  I NONE 
NONE  : : none 

( nthree  > : < NTWO  > ( s ) 

3 

The  nuabers  are  introduced  by  Harris  because  an  N can  occur 

2 2 
everywhere  an  N can  occur,  except  in  the  rules  turning  an  N plus 

3 

something  else  into  an  N ; thus,  by  collapsing  all  these  rules  for  items 
in  the  same  environments,  much  duplication  of  rules  is  avoided. 

A related  example  is  given  by  Harris  to  show  the  utility  of  grouping 
morpheme  classes  into  classes  of  "positions"  in  which  morphemes  occur 
(Harris  1951).  If  morphemes  of  class  Q occur  in  positions  which  are  the 
same  as  those  of  class  R morphemes  (the  two  being  differentiated  only 
in  which  adjuncts  such  as  '-ly',  '-al'  they  occur  before),  then  it  is 
possible  to  make  a "positional  category"  N which  includes  Q and  R (which 

are  now  to  be  written  N and  N.  respectively,  to  show  that  they  are 

• D 

members  of  clsss  N).  The  adjuncts  -ly,  -al  are  similarly  classified 

into  Ns  and  Na..  "...we  have  the  equations  N + Na  ■ A,  N.+  Na.-  A, 
a b a a d d 

etc. , all  of  which  can  be  summarized  in  the  position-class  equation 
N + Na  - A.  It  is  understood  that  this  equation,  unlike  our  previous  ones, 
holds  not  for  every  member  of  the  clssses  involved  but  only  for  certain 
members  (or  sub-classes)."  The  ones  for  which  it  holds,  of  course,  are 
the  corresponding  ones  which  sppeared  together  among  the  sub-class 
equations,  snd  which  have  been  here  suppressed. 

The  similarity  of  this  understanding  to  the  Uniform  Replacement 
Convention  suggests  that  a similar  van  Wijngaarden  grammar  could  be 
written  along  the  lines  of 

CLASS  : : -*  a I b 

( a ) : ■*  < n sub  CLASS ) ( na  sub  CLASS  ) 

which  serves  as  the  "position-class  equation"  N + Na  - A.  The  sub-class 
equations  summarized  by  this  one  can  then  be  recorded  in  further 
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hyper-rules: 


< n sub  a'  : < q> 

< na  sub  a)  : < -ly  terminal) 

< n sub  b>  : **  < r > 

< na  sub  b)  : < -al  terminal) 

As  Hsrrls  remarks,  "It  is  impossible  to  eliminate  from  our  records  the 
explicit  sub-class  equations";  their  rule  is  here  assumed  by  these  further 
hyper-rules  which  preserve  what  each  sub-class  of  N and  Na  represents. 

The  "understanding"  or  convention  of  Harris's  that  these  position-class 
equations  are  to  be  understood  specially  as  holding  only  of  corresponding 
sub-classes  is  not  formally  represented  in  his  notation;  in  the  van 
Wijngaarden  grammar,  of  course,  this  is  reconstructed  by  the  meta-symbols 
in  the  hyper-rules  and  by  the  URC  governing  their  replacement. 

The  motivation  here  again  seams  to  have  been  a desire  to  collapse 
rules  dealing  with  the  identical  or  nearly-ldentlcal  environments  of  the 
"Q"  and  "R"  morphemes,  but  in  this  case  there  is  also  a need  to  define 
slots  and  to  require  that  they  be  filled  subject  to  agreement;  so  thla 
usage  shades  over  into  our  second  type. 

The  second  motivation  for  use  of  structured  vocabulary  in  earlier 
linguistic  work  appears  to  be  the  desire  to  represent  phemonena  of  "agree- 
ment" or  "concord." 

This  was  sometimes  seen  as  of  a piece  with  the  problem  of  "dis- 
continuous constituents."  Harris,  again  (1951),  employe^  what  he  called 
"long  components"  (by  analogy  to  supra-segmental  phenomena  in  phonology) 
to  express  agreement:  "Similarly,  ...a... a is  a single  morphemic  segment, 
meaning  female"  in  Latin  fllla  bona  'good  daughter';  such  a female  "long 
component"  is  treated  as  a component  part  of  several  complex  symbols. 

This  process  is  further  extended  and  is  really  rather  sophisticated,  but 
it  is  never  at  all  comfortable  within  an  immediate  constituent  analysis. 

Quite  a number  of  people  appear  to  have  thought  of  utilizing 
"variables"  in  their  formulas  or  representations,  with  something  like 
the  Uniform  Replacement  Convention  to  govern  them  and  thus  to  enforce 
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v.  uSutabL  - — 


agreement.  Chomsky,  in  $ 54  of  (Chomsky  1955/1975)  considers  tliia  very 
interpretation  of  ’long  components'-  and  propoaea  a notation  for  roles: 


n „k  <■'  k 

P|  Pj  Pj 


Here  the  "long  component"  la  the  symbol  "k".  Obviously  the  fi rat  rule  la 
something  like  a van  Wijngaarden  hyper-rule,  and  the  last  two  rules  are 
something  like  meta-rules  (and  so  we  sea  that  there  is  no  separation  of 
levels  in  this  grammar).  Observe  that,  Just  as  in  the  Harris  super- 
script numbers,  the  variables  occur  only  on  the  right  side  of  the  rule. 

Chomsky  then  goes  on  to  propose  something  very  like  the  URC  to 
govern  ths  Interpretation  of  these  rules,  adding  "suppose  further  that 
by  convention  all  identical  superscripts  assume  the  same  value  in  deri- 
vations ....  The  derivationa  would  now  work  out  exactly  right,  the 
algebra  would  ba  restricted,  and  the  notations  NP,  VP,  etc.  would  be 
retained  with  all  essential  generality."  What  this  means  is  that  because 

g 

of  the  "component ial"  nature  of  category  aymbo's  like  P^  , it  ie  possible 
to  let  the  P be  "NP"  and  to  recognise  it  aa  the  same  symbol  regardless 
of  what  value  of  k qualifies  it.  Chomsky’s  rules  and  convention  are  clearly 
Intended  to  be  Interpreted  to  be  virtually  Identical  to  the  van  Wijngaarden 
rules 

K ; : a I b 

< pi  > ! -*  < pj  K ) ( p»  K ) 


Chomsky  decides,  on  the  basis  of  having  tried  such  a system  to 
repreaent  long  components  In  hla  Morphophouamica  of  Modern  Hebrew  (1951) 


that  It  could  well  he  uaeful  for  such  things  aa  number  agreement,  but 
it  la  not  an  appropriate  device  for  imposing  the  vast  complex  of  restric 
ttona  neceaaary  to  avoid  "the  rearming  of  Germany  ia  at  dinner"  — 
that  la,  apparently,  for  telling  grammatical  sense  from  grammatical 
nonaenae.  Chomsky  then  concludes  with  a note  of  much  Interest  to  those 


studying  van  Wijngaarden  grammars: 

This  ia  an  important  question,  deserving  of  a much  fuller 
treatment,  but  it  will  quickly  lead  into  areas  where  the 
present  formal  apparatus  may  be  Inadequate.  The  difficult 
question  of  discontinuity  ia  one  such  problem.  Discontinuities 
are  handled  in  the  present  treatment  by  construction 
of  permutational  mappings  from  P to  V (phrase  structures 
to  words),  but  it  may  turn  out  that  they  must  ultimately 
be  incorporated  into  P (phrase  structure)  itself. 

Chomsky  himself  appears  to  have  believed  that  transformations 
were  another  tool  for  dealing  with  agreement,  and  so  naturally  he  was 
more  interested  in  following  up  the  transformational  approach  to  the 
question.  Postal  (1964b)  extolla  the  incomparable  value  of  the  trans- 
formation to  achieve  agreement,  and  Koutsoudas  (1966)  exhibits  many 
examples  of  the  technique  of  generating  an  item  of  agreement  in  one 
constituent  and  then  transformationally  copying  it  into  other  consti- 
tuents with  which  agreement  was  required.  Postal  (1964b)  also  credits 
Sydney  Lamb  (1962)  in  "stratlf lcatlonal  grammar"  and  Elson  and  Pickett 
(1960)  in  "tagmemics"  with  introducing  the  ideas  of  variable-symbols 
and  their  uniform  replacement  to  describe  agreement  phenomena;  further 
study  would  probably  uncover  several  more  similar  developments. 

Our  preliminary  scan,  then,  indicates  early  use  of  structured 
vocabulary  for  two  major  purposes:  (1)  to  indicate  that  two  category, 
symbols  share  many  properties  but  are  not  subject  to  expansion  by  all 
the  same  rules  — that  is,  to  encode  "rule  features";  (2)  to  indicate 
that  two  category  symbols  share  many  properties,  but  differ  in  their 
"contextual  restrictions"  of  the  sort  usually  thought  of  as  "agreement". 
Either  use  can  be  naturally  incorporated  into  a van  Wijngaarden  grammar. 

4.2  Structured  Vocabulary  in  Transformations  and  'Extended  Phrase 
Structure  Grammars' 

Chomsky,  throughout  the  period  of  Syntactic  Structures  (1957)  and 
"A  Transformational  Approach  to  Syntax"  (1962)  continues  to  employ 
symbols  which  have  the  informal  appearance  of  being  structured,  but  he 
does  not  provide  any  formal  method  of  representing  their  relatedness. 
Thus,  he  gives  rules  such  as 


115 


NP  . *♦  T + N + ^ 

sing 


T + N + S 


Pred  -*■ 


NP  . in  env.  NP  , + Aux 

sing  sing 


NP  , in  env.  NP  , + Aux 

pl  pi 


fb'  ] 

1 become} 

(b*  \ 

1 become} 


(Comp) 

[?vt  j 


VTa'  VTb VTg'  in  enV’  Nh 


V in  env . 

Tx 


Hare  in  the  first  rules  the  symbols  NP#lng  and  NP  ^ *r®  being 
used  for  number  agreement;  but  the  two  symbols,  no  matter  how  suggestively 
similar,  are  different  and  unrelated  symbols  in  the  grammar.  It  makes 
no  difference  that  the  brackets  are  drawn  around  alternatives  in  the 
first  rule,  or  drawn  around  two  entire  rules;  the  symbols  NP§jng  *nd 
r attain  as  distinct  as  Aux  and  Prt  in  the  vocabulary  of  the  grammar.  The 
rules  dealing  with  verbs  use  context-sensitive  format  to  impose  rule- 
features  as  well  as  agreement. 

The  reason  why  Chomsky  gate  along  reasonably  well  without  a method 
for  relating  various  symbols  in  the  vocabulary  is  that  the  only  "relatedness" 
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of  importance,  in  much  of  this  work,  is  that  a group  of  symbols  should 
bs  treated  the  same  way  by  transformations,  and  this  kind  of  ralstedneaa 
is  reconstructed  in  the  definition  of  a transformation  rather  than  in 
the  vocabulary  of  node  labels. 

A transformation  operates  on  terminals,  but  segments  them  into 
terms  based  on  the  complicated  notion  of  a "proper  analysis"  which  amounts 
to  a string  of  non-overlapping  sub-treae  which  together  exhaustively 
dominate  the  terminal  string.  This  means,  among  other  things,  that  if 
the  structural  description  of  the  passive  transformation  la 

NP  - Aux  - VPP  - NP 

12  3 4 

then  its  first  term  is  as  trail  satisfied  by  a subtree 

NP 


NP  . 
sing 


T N 


as  it  is  satisfied  by  the  alternative  subtree 


NP 


because  only  the  top  node  of  the  aub-tree  is  relevant  to  satisfying  tie 
structural  description  of  the  tranaf onset ion. 

This  means  that  there  lis  one  way  NPp^  and  NP#^ng 


, for  example. 


•re  related  in  the  grammar,  and  that  is  that  they  can  both  be  rewritings 
of  NP.  This  is  utilised  by  the  transformational  rules,  since  in  the 
structural  description  of  a transformational  rule  a node  label  is  like 
a variable  which  stands  for  any  string  of  terminals  which  can  be  derived 
from  that  node  in  the  base  grammar.  (Distinguish  this  from  the  usual 
meaning  of  "variable"  in  transformational  specifications,  which  is  any 
unspecified  lef t-to-right  factorization.) 

This  interplay  between  the  structure  of  the  base  grammar  and  the 
structural  description  specifications  of  transformational  rules,  then, 
acts  in  some  ways  like  a two-level  grammar.  To  see  how  this  is  so,  we 
can  model  this  particular  aspect  of  a transformational  grammar  as  a van 
Uljngaarden  grammar.  (This  is  purely  to  guide  intuition  about  the  simi- 
larity; there  is  no  suggestion  that  a van  Wijngaarden  grammar  can  be  made 
to  act  like  a transformational  grammar  in  any  satisfactory  way.  But  on 
this  one  point,  the  similarity  is  striking.)  We  imagine  that  the  base 
grammar  of  a transformational  grammar  is  the  meta-grammar,  and  the 
transformations  are  hyper-rules;  modeling  the  rules  on  those  presented 
at  the  first  of  this  section,  we  could  have: 

NP  : : -*  NPSING  I NPPL 

NPSING  •••-*■  t n 

NPPL  : : -*  t n -s 

VP  : : -*■  AUX  VFP 

VPP  : : -*■  VTCM  comp  I VTPR  prt 

VTCM  : : - VTA  I VTB  I ...  I VTG 

VTPR  : : “*•  VTX 

and  so  forth;  the  entire  base  component  is  simply  a context-free  raeta- 
granmar.  Then  the  passive  transformation  is  a hyper-rule: 

< NP1  AUX  VPP  NP2  > : ( NP2  > < AUX>  <be-en>  <VPP>  <by>  <NPl> 

(Notice  that  there  is  only  one  hyper-symbol  on  the  left  of  this  rule  — 
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a string  of  meta-symbols  — but  there  are  six  hyper-symbols  on  the  right 
side.) 

This  hyper-rule  simply  and  correctly  represents  all  the  (many, 
many,  many)  rules  in  which  a terminal  string  derived  (in  the  meta-grammar) 
from  the  meta-symbols  in  the  left-hand  hyper-symbol  is  re-written  as  its 
correct  permutation  under  the  passive  transformation.  The  Uniform  Replace- 
ment Convention  will  assure  that  the  subject  and  object  NF's  (1  and  2) 
are  reversed,  and  that  they  are  exactly  the  same  terminal  strings  in  the 
active  and  passive  versions. 

Thus,  again:  the  reason  that  NPging  and  NP  are  treated  alike  by 
transformations,  is  exactly  the  same  reason  that  NPS1NG  and  NPPL  are 
treated  alike  in  the  "passive  hyper-rule";  the  permutation  rule  is  written 
in  terms  of  the  category  NP,  and  both  symbols  can  be  derived  from  NP  in 
the  base  gramma*  (meta-grammar). 

(To  bring  up  this  comparison  suggests  the  question  of  whether  van 
Wijngaarden  grammars  can,  or  should,  be  used  in  this  way  to  replace 
transformational  grammars.  The  answer  appears  to  be  no,  to  both,  although 
an  attempt  has  been  made  to  use  van  Wijngaarden  grammars  in  just  this 
way  to  describe  natural  languages  (de  Chastelller  and  Colmerauer  1969) 
from  which  attempt  a still  more  general  formalism  for  manipulating  tree- 
structures  was  later  developed  (Colmerauer 's  Q-system) . In  any  case  some 
additional  conventions  are  necessary  to  understand  van  Wijngaarden  grammars 
as  tree-manipulating  systems,  since  ordinarily  the  trees  of  the  meta- 
grammar have  no  interpretation.  This  leads  to  different  usages,  which 
we  will  not  consider  here.) 

An  Important  point,  however,  is  that  transformational  rules  gain 
some  of  their  naturalness  from  their  ability  to  refer  to  many  different 
strings  under  a single  variable  cover-term,  and  their  ability  to  specify 
the  strings  corresponding  to  the  variable  by  a phrase-structure  grammar 
(the  base  grammar  of  the  transformations);  this  is  so  basic  to  the  system 
that  it  is  usually  thought  of  as  Inherent,  and  is  the  property  called 
"structure-dependence"  in  transformational  theory. 

At  the  same  time  that  Chomsky  was  making  use  of  this  special  kind 
of  "two-level"  property  in  his  transformational  grammars,  there  was  a 
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quite  different  proposal  for  two-level  grammars  which  met  a bizarre  fate; 
this  was  the  "extended  phrase  structure  grammar"  of  Gilbert  Harman 
(Harman  1963). 

The  Harman  proposal  was  not  really  novel,  being  pretty  much  a 
proposal  to  write  rules  in  the  programming  language  COMIT  (Yngve  1960, 

1961,  1972);  Chomsky  will  have  it  that  the  notation  was  really  devised 
by  G.  H.  Matthews  while  writing  a grammar  of  German  in  1957-58,  and  that 
Matthews  really  developed  the  COMIT  system  as  well  (Chomsky  1965,  pp.  79 
and  213).  Harman's  paper  has  become  well  known  only  because  of  some 
unwontedly  colorful  remarks  about  ants  and  antelopes  and  about  extended 
baboons  in  which  Chomsky  attacked  it  (Chomsky  1966);  his  criticism 
must  have  been  chiefly  motivated  by  Hannan's  r^ovorative  stance  in  main- 
taining that  transformations  had  been  proved  to  be  unnecessary,  because 
the  substance  of  Harman's  paper  should  not  have  been  offensive. 

Basically,  the  Harman  rules  envisioned  a structured  vocabulary 
consisting  of  a set  of  category  symbols,  each  augmented  with  a set  of 
"subscripts"  or  "tags".  An  ordinary  context-free  rule  such  as 

A — ^ B C 

is  then  understood  to  mean  that  A is  rewritten  as  B followed  by  C,  and 
that  all  of  A's  tags  are  "copied"  onto  B and  onto  C;  it  is  just  like 
the  static  notion  of  a schema 

< a TAGS>  < b TAGS)  <C  TAGS  > 

The  definition  of  the  tags  is  not  clearly  separated,  however,  and  thus 
there  are  many  additional  conventions  for  adding  to,  modifying,  and 
erasing  the  tag  set  associated  with  a particular  symbol.  Rule  features 
are  Implemented  by  conditions  on  the  tags  associated  with  the  left- 
side symbol. 

Harman's  "defense  of  phrase-structure"  (the  subtitle  of  his  paper) 
is  that  Chomsky  has  not  properly  represented  the  tradition  of  immediate- 
constituent  analysis  in  defining  context-free  grammars;  among  other  things, 
Chomsky  has  not  represented  discontinuous  elements  or  agreement.  (And 
this  was,  as  we  have  remarked  above,  a legitimate  complaint.)  If  Chomskian 


120 


I 

phrase-structure  grammars  are  augmented  to  restore  these  traditional 

parts  of  immediate-constituent  analysis,  Harman  believes  that  such 

"extended  phrase-structure  grammars"  can  then  adequately  describe  natural 

languages;  unfortunately,  however,  the  resulting  grammars  will  be  very 

large  — too  large  to  handle  in  practice.  Harman  replies  to  this  admitted 

difficulty  in  using  extended  phrase-structure  grammars  by  saying  that 

the  proper  answer  to  this  practical  problem  is  that  it 
is  only  a technical  difficulty.  Being  only  technically 
a difficulty,  it  can  be  overcome  by  changing  techniques. 

We  require  some  technique  which  will  enable  us  to  write 
and  grasp  millions  of  rules  at  once;  that  is,  we  require 

a useful  way  of  abbreviating  large  sets  of  rules.  ( 

The  desire  to  "write  and  grasp  millions  of  rules  at  once"  is  probably 
the.  best  description  given  so  far  of  the  motivation  leading  to  two- 
level  van  Wljngaarden  grammars  and  similar  grammars  with  structured 
vocabulary. 

But  Harman's  purpose  of  abstractly  abbreviating  large  sets  of 

rules  gets  lost  in  Chomsky's  rebuttal,  covered  over  by  quarrels  about 

whether  the  result  is  still  properly  called  a phra3e-structure  grammar. 

Just  how  lost  it  was  can  be  seen  from  McCawley's  review  of  Chomsky's 

argument  (McCawley  1968b),  in  which  McCawley  says  that  if  Chomsky  had 

penetrated  behind  the  terminological  question 

He  could  have  made  a far  more  devastating  criticism  of  it 
than  he  presented.  Harman  is  able  to  dispense  with  agreement 
transformations  only  at  the  cost  of  having  separate  rules, 
e.g.  to  select  the  number  of  the  subject  NP  (which  must  be 
attached  as  a feature  of  the  S-node  which  dominates  it,  so 
as  to  allow  that  feature  to  be  'inherited'  by  the  VP  through 
Harman's  'feature  inheritance'  mechanisms),  and  to  select  the 
number  of  all  other  NP's.  (Footnote:  Because  Harman 
neglected  to  include  this  latter  type  of  rule  in  his  restatement 
of  the  rules  of  Chomsky  (1957),  his  rules  generate  only 
sentences  in  which  all  NP's  have  the  same  number.)  Since  the 
inheritance  of  features  from  a common  dominating  node  in  the 
surface  structure  is  Harman's  only  means  of  incorporating 
selectional  restrictions  into  a grammar,  and  since  there  are 
infinitely  many  types  of  verb-NP  selectional  restrictions 
which  can  hold  in  surface  structure,  Harman's  treatment 
would  require  an  infinite  number  of  selectional  features 
and  infinitely  many  rules  to  attach  them  to  the  relevant 
S-nodes. 

McCawley's  account  of  Harman's  method  is  quite  accurate,  particularly 
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In  describing  the  feature-inheritance  mechanism  which  it  shares  with 
van  Wijngaarden  grammars;  but  McCawley  believes  that  this  accurate 
account  la  a devastating  criticism  because  of  its  obvious  absurdity. 
McCawley  seems  not  to  have  appreciated  the  point  that  it  is  not  absurd 
to  define  a language  by  millions  of  rules  (or  by  an  infinite  number  of 
rules),  so  long  as  you  have  a technique  for  defining  those  rules  which 
permits  you  to  "write  and  grasp"  than. 

There  are  two  ways  in  which  Harman's  use  of  structured  vocabulary 
in  grammars  differs  from  Chomsky's  use  in  transformations,  however,  and 
both  are  of  interest  in  the  context  of  machine-translation  and  computa- 
tional linguistics. 

First,  Harman  used  (implicit)  variables  in  the  phrase-structure 
rules  themselves,  and  did  not  try  particularly  to  simulate  transformations 
in  the  way  explained  earlier  (rewriting  permuted  node  strings);  the  tags 
of  Harman's  symbols  are  used  both  to  enforce  agreement,  and  to  collapse 
rules  by  utullxlng  rule-features  — much  like  the  similar  uses  by  pre- 
Chomsklan  linguists. 

Second,  the  notation  scheme  used  by  Harman  was  developed  as  a 
programming  language,  specifically  for  work  in  machine  translation  and 
natural  language  research,  which  suggests  that  the  idea  of  category 
symbols  qualified  by  features  imnediately  appealed  to  the  linguistis 
who  tried  to  use  the  early  computers  (hopelessly  short  of  software)  for 
natural  language  processing.  In  fact,  some  Interesting  work  was  done 
in  COMIT,  and  in  a version  of  COMIT  coded  into  Lisp  ('METEOR';  see 
Bobrow  1964) , 

4.3  Poit-Aapecta  Use  of  Complex-Symbol  Vocabularies 

Tho  next  chapter  of  the  story  is  of  unusual  interest,  because 
the  device  rejected  so  strongly  is  made  the  cornerstone  of  the  theory 
of  the  base  component  of  a transformational  grammar  (Chomsky  1965). 

In  Aspects  of  the  Theory  of  Syntax,  Chomsky  takes  the  point  of 
view  that  non-branching  re-write  rules  such  as 


r 
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are  clearly  not  the  correct  way  to  achieve  subcategorization.  "Although 
this  defect  was  pointed  out  quite  early,"  he  says,  "there  was  no  attempt 
r deal  with  it  in  most  of  the  published  work  of  the  last  several 
years."  Chomsky  gives  credit  for  the  earliest  recognition  of  this 
fact  to  G.  H.  Matthews,  developer  of  the  COMIT  notation  for  "Extended 
Phrase-Structure  Grammars."  (Other  schemes  are  given  in  Bach  1964 
and  Schachter  1962).  A base  component  very  close  in  structure  to  a 
two-level  grammar  is  proposed  in  Seuren  1968.) 

The  Aspects  theory  of  base  grammars  is  developed  twice.  First, 
a proposal  is  made  to  have  four  kinds  of  rules:  (1)  Context-free 
rewrite  rules  which  develop  the  entire  non-terminal  phrase  structure 
of  a phrase-marker;  (2)  context-free  subcategorizatlon  rules  which  intro- 
duce terminals  with  "inherent  features";  (3)  context-sensitive  strict 
subcategorizatlon  rules  which  Introduce  " contextual  features"  of  the 
geometry  of  the  phrase-marker;  (4)  context-sensitive  selectional  rules 
which  introduce  "contextual  features"  of  other  terminals. 

Example*  of  these  four  t’oes  of  rules  would  be: 

(1)  CF  rewrite:  S NF ^Predicate-Phrase 

(2)  CF  Subcategorization:  (+N]  (>Det  ] 

(•►Count]  *♦  (^Animate) 

(3)  CS  Strict  Subcategorization:  (*V)  (+  NP)  / NP 

(4)  CS  Selectional:  (+v]  *♦  (♦ (^Abstract J -Subject]  / (+N, ^Abstract)  Aux 


(There  is  a very  strong  resemblance  between  these  last  three  types 
of  rules  which  have  left-sides  meaning  "a  symbol  with  at  least  the 
features  l f]",  and  the  COMIT-Harmon  scheme.) 

The  obvious  redundancy  of  these  last  rules  Is  then  used  to  motivate 
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* proposal  to  have  their  effect  achieved  by  conventions  on  lexical 
insertion,  such  that  lexical  items  are  inserted  with  their  inherent 
features  from  a lexicon,  observing  the  constraints  that  items  with 
strict-subcategorization  features  are  inserted  only  into  conforming 
base  structures,  and  items  with  selectional  features  are  Inserted  only 
into  base  structures  with  conforming  lexical  items. 

This  set  of  conventions  is  then  treated  as  defining  "lexical 
insertion  transformations",  with  the  observation  that:  aubcategorlzation 
is  achieved  by  "local  transformations"  which  only  a‘:fect  a substring 
dominated  by  a single  category  symbol,  and  which  (Chomsky  suggests 
in  a note)  may  be  sensitive  to  the  "vertical  context"  of  dominating 
nodes  as  well  as  to  the  "horizontal  context"  usually  employed  in  con- 
text-sensitive rules. 

Rules  with  the  properties  of  these  Chomskian  "local  transforma- 
tions" have  recently  been  studies  by  Joshi  and  Levy  (1977),  who  generalize 
the  very  satisfying  result  of  Peters  and  Ritchie  (1973)  regarding  context- 
sensitive  rules,  to  the  expected  further  result  that  "local  transformations" 
(context-free  rules  constrained  by  Boolean  combinations  of  proper-analysis 
predicates  and  domination  predicates  — really  quite  a general  definition) 
when  used  as  node-admissibility  conditions  to  constrain  structural 
descriptions,  admit  terminal  strings  which  are  only  context-free  languages. 

Thus,  it  is  reasonable  to  consider  the  entire  Aspects  base  component 
as  specifying  a set  of  derivation-initial  phrase-markers  in  a two-level 
grammar,  where  only  the  hyper-grammar  is  made  explicit  (although  possible 
meta-symbols  are  characterized  by  the  redundancy  ru.es  for  features), 
and  restrictions  on  the  hyper-symbols  are  stated  as  predicate  hyper- 
symbols. 

Additional  motivation  for  such  a formulation  of  the  base  component 
if  afforded  by  more  recent  work  (Chomsky  1970,  1977,  Chomsky  and  Lasnik 
1977)  in  which  a complex-symbol  analysis  of  non-terminal  as  well  as 
terminal  categories  is  used  to  capture  additional  regularities  and 
restrict  the  possible  rules  of  the  base;  some  of  this  is  similar  to  the 
material  discussed  in  section  3.3  above  on  the  X-bar  convention. 

(The  complex-symbol  notatlonal  conventions  were  originally  developed 
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in  connection  with  phonological  rules,  and  the  description  of  the  formalism 
in  (Chomsky  and  Halle  1968,  pp.  390-399)  may  be  of  some  use  in  understanding 
how  a van  Wijngaarden  grammar  would  be  employed  on  sets  represented  as 
strings. ) 

4.4  Computational  Linguists  and  Structured  Vocabulary  in  Grammars 

Without  having  a good  understanding  of  what  counts  as  structured 
vocabulary  for  grammars,  it  would  be  possible  to  come  to  the  conclusion 
that  computational  linguists  had  made  comparatively  trivial  use  of 
grammars  with  structured  vocabulary;  nearly  every  granmar  is  written 
with  names  for  category  symbols  which  are  related  to  one  another, 
but  the  way  in  which  their  relatedness  is  exploited  by  the  processing 
programs  is  not  as  obvious.  A closer  look,  however,  reveals  that  ideas 
Importantly  related  to  grammars  with  structured  vocabulary  have  been 
used  by  some  of  the  most  notable  computational  linguistics  projects, 
often  in  slightly  disguised  form. 

The  general  line  of  development  is  usually  considered  under  the 
heading  of  parsers  for  unrestricted  rewriting  systems  (type-0  grammars 
in  the  terms  of  Chomsky  1959).  Thus,  one  of  the  earliest  significant 
systems  of  this  sort  was  Yngve's  COHIT  programming  language  (Yngve 
1961,  1972)  designed  specifically  for  research  on  natural  languages  and 
machine  translation,  and  discussed  briefly  in  section  4.2  above  in  connec- 
tion with  Gilbert  Harman's  use  of  the  notation. 

The  COHIT  language  is  based  upon  the  format  of  Markov's  "normal 
algorithms";  the  individual  statements  in  the  language  are  for  this 
reason  called  "rules",  and  they  operate  by  identifying  a string  and 
re-writing  it.  COMIT  adds  labels  and  transfers  to  the  uotation,  and  the 
resulting  "labelc  ’ Markov  algorithms"  arc  sufficiently  convenient  to  be 
used  for  many  string-oriented  tasks  as  a general  programming  language 
(see  Brainerd  and  Landweber  197A,  Chapter  5). 

The  basic  type-0  grammar  format  of  the  COMIT  language  is  in 
principle  sufficient  to  achieve  any  computation,  but  from  its  earliest 
versions  an .additional  mechanism  of  "subscripts"  was  used,  which  amounts 
to  a type  of  two-level  grammar. 
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Each  symbol  In  a COMIT  program  may  have  "subscripts,"  which  may 
in  turn  have  values  (values  are  "essentially  sub-subscripts"  (Yngve 
1961,  ^ 11121).  The  subscripts  are  something  like  feature  bundles,  the 
sub- sub scripts  something  like  individual  features.  (Numeric  subscripts 
are  also  available,  which  have  somewhat  different  properties).  COMIT 
rules  manipulate  the  symbols  as  basic  constituents,  subject  to  the 
convention  that  re-writing  a symbol  means  including  its  subscripts  on 
the  resulting  symbols;  there  are  a great  many  possible  variations  which 
may  be  stated,  such  as  minimum  or  maximum  sets  of  subscript  features 
necessary  on  the  left-side  symbol  for  the  rule  to  apply,  and  explicitly 
setting,  erasing,  and  merging  subscript  sets. 

Although  a COMIT  'grammar'  (program)  is  only  a one-level  grammar. 
Its  vocabulary  Is  •structured  in  a way  which  — like  a two-level  grammar 
— makes  it  possible  to  abbreviate  large  sets  of  rules  In  schemata. 

There  Is  nothing  In  the  theory  of  Markov  algorithms  to  suggest  this 
step,  so  Its  Inclusion  must  have  been  prompted  by  the  linguist-designers' 
feelings  that  for  natural-language  rules  the  use  of  systematically 
structured  symbols  would  enhance  the  ease  and  naturalness  of  using  the 
COMIT  system.  Given  the  developments  in  linguistics  reviewed  in  previous 
sections,  this  Is  not  surprising. 

There  are  two  principal  paths  of  development  from  this  early 
work  of  Yngve's  (and  of  G.  H.  Matthews',  according  to  Chomsky).  The 
first  is  the  work  of  pattern-matching,  which  results  in  Snobol4  (Griswold 
at  al.  1971,  Gaskins  and  Gould  1972)  and  Its  underlying  theory  (Gimpel 
1973,  1975,  1976).  Although  related  to  two-level  grammars,  this  line 
will  no'  ;>t  followed  up  here.  The  other  line  of  work  Is  in  type-0 
parsers,  *nd  here  the  most  influential  publication  is  Martin  Kay's 
"powerful  parser"  (Kay  1967). 

Kay's  grammars  are  again  not  well  separated  into  levels,  but  they 
contain  rules  such  as: 

SG.l  - NUMU) 

PL.l  « NUM(l) 

N. 1 NUM.2  V. 3 2 - N0UN(12)  VERB(32) 
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(Since  these  are  recognition  rules,  Kay  writes  them  backwards  — the 
left  side  of  the  rule  is  on  the  right  side  of  the  paper,  and  vice 
versa.  The  numbers  after  dots  assign  identifying  numbers;  the  numbers 
in  parentheses  use  those  identifications  to  build  constituent  structure.) 
The  first  two  rules  here  recognise  either  singular  or  plural  number 
as  being  of  category  NUM.  (The  single  item  — dot  one  — is  made  the 
sole  constituent  of  a NIM  node  — parenthesized  one.)  The  third  rule 
then  recognizes  four  elements:  (1)  a noun  N,  (2)  a number  morpheme  NUM, 
(3)  a verb  V,  and  (4)  a second  number  morpheme  which  is  the  same  (SG 
or  PL)  as  item  two  — and  so  this  fourth  item  is  not  assigned  a number 
of  its  own.  These  four  items  are  rewritten  as  two  Items  (the  left  side 
of  the  rule,  seen  here  on  the  right  side  of  the  paper):  (1)  a noun  phrase 
NOUN  dominating  right-side  items  1 and  2,  and  (2)  a verb  phrase  VERB 
dominating  right-side  items  3 and  4 (3  and  2,  since  4 and  2 are  required 
to  be  identical).  After  this  rule  has  applied  a possible  partial  parse- 
tree  would  be 


SG  SG 

Agreement  of  the  SG's  during  recognition  is  forced  here,  it  should 
be  noted,  by  the  analogue  of  the  Uniform  Replacement  Convention;  the 
two  instances  of  NUM  can  be  required  to  have  the  same  value  because 
they  are  recognized  in  a single  rule;  and  that  is  why  what  are  essentially 
two  unrelated  context-free  rules 
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NOUN  -+  N + NUM 
VERB  *♦  V + NUM 

are  applied  together  in  a single  type-0  rule  application. 

Unfortunately  not  too  much  development  of  this  sort  of  use  is 
given,  because  the  "main  concern"  of  Kay's  paper  is  "to  discuss  the 
extent  to  which  the  program  we  have  been  describing  can  be  made  to 
function  as  a transformational  analyzer."  A compulsion  to  use  a type-0 
grammar  to  effect  "transformations"  runs  through  this  whole  line  of 
work.  It  is  exactly  the  same  ability  as  the  use  of  van  Wijngaarden 
grammars  to  effect  "transformations"  which  we  discussed  above  (section 
4.2),  and  unfortunately  it  always  exceeds  the  range  of  manageable 
complexity  rather  quickly. 

Immediate  successors  of  Kay  are  Kaplan's  General  Syntactic  Processor 
(Kaplan  1973)  — Kay's  parser  plus  some  extras  from  Woods's  ATN's  — 
and  the  REL  (Rapidly  Extensible  Language)  System  (Dostert  and  Thompson 
1971,  1972).  The  REL  development  is  of  some  interest,  because  to  the 
original  concept  of  Kay's  type-0  parser  has  been  added  a layer  of 
features,  so  that  the  resulting  grammar  has  the  same  structure  as  a COMIT 
program. 

Coverage  of  a large  part  of  English  is  claimed,  with  only  300 
rules  in  the  REL  English  grammar.  These  are  really  rule  schemata,  like 
hyper-rules,  since  a rule  has  the  form: 

VP  ‘ NP  VP 

FEATURE  CHECK:  VP  must  be  (-Passive)  (-Subject)  and  (-Agentive) . 

FEATURE  SET:  Assign  (+Subject)  and  (+Age»tive)  to  VP’  together 

with  the  features  of  VP. 

This  summarizes  (as  we  may  reinterpret  the  formalism)  all  the  rules 
(production-rules)  in  which  categories  representing  all  kinds  of  feature- 
bundles  participate,  so  long  as  the  feature-bundles  of  NP'  and  VP  are 
related  as  "dynamically"  specified  in  the  final  condition,  and  so  long 
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as  the  feature-bundle  of  the  VP  has  the  specified  values.  Hidden 
under  the  preocedural  terminology  of  "check"  and  "set"  or  "assign" 
we  discern  a hyper-grammar  with  hyper-rules  such  as 

< vp-prime  VFEATURES1 ) : < np  NFEATURES  > < vp  VFEATURES2  ) 

( where  VFEATURES2  includes  (-pas,-subj ,-agt)  ) 

< where  VFEATURESl  is  VFEATURES2  but  (+subj , +agt) > 

In  this  version  the  "feature  check"  and  "feature  set"  actions  of  the 
REL  rule  have  become  the  two  predicate  hyper-notions.  It  is  not  too 
hard  to  see  how  to  define  predicates  like  these  syntactically,  although 
in  an  implementation  they  would  of  course  be  implemented  just  as  the 
"check"  and  "set"  actions  are  implemented  in  REL. 

The  importance  of  the  features,  or  rather  the  inadequacy  of  a 
grammar  of  300  rules,  is  not  easy  to  overestimate.  The  example  sentence 
"Has  John  attended  the  school  of  Cambridge's  Mayor?"  is  said  to  parse 
unambiguously  in  REL  English  with  features,  but  to  be  2, 701-ways  ambiguous 
in  the  same  grammar  without  features  (Dostert  and  Thompson  1972).  In 
REL  English  the  subscript  features  are  said  to  have  three  roles:  (1)  to 
subcategorize  parts  of  speech;  (2)  to  prevent  ungrammatical  strings 
(l.e.,  to  collapse  nearly-ldentical  rules);  and  (3)  to  determine  the 
preferred  order  of  syntactic  groupings  (such  as  preventing  multiple 
ambiguities  in  strings  of  nouns  or  adjectives  — a motivation  analogous 
to  that  for  Harris's  superscript  numbers). 

The  REL  grammar  is  rather  easy  to  see  as  a very  large  context- 
free  grammar  abbreviated  in  a way  somewhat  like  a van  Wijngaarden  grammar; 
the  "category  symbols"  provide  a gross  check  on  the  applicability  of  a 
rule  during  parsing,  and  only  if  that  test  is  passed  is  it  necessary  to 
check  to  see  if  a detailed  rule  which  is  a refinement  of  the  gross  form 
is  actually  applicable.  (And,  the  detailed  rules  do  not  physically 
exist,  but  are  made  up  on  the  fly  from  the  abbreviatory  schemata  plus 
the  features  actually  discovered  to  be  present.) 

This  same  form  of  grammar  has  been  used  in  several  other  recent 
computational  linguistics  projects,  but  concealed  still  further  in 
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"procedural"  terminology,  and  making  use  of  program  fragments  instead 
of  a more  abstractly-defined  data  base  of  rules.  This  is  the  work  of 
Woods,  of  Sager,  and  of  Winograd. 

It  is  perhaps  no  accident  that  each  of  these  three  approaches 
is  somewhat  "an  implementation  in  search  of  a theory.”  The  engineering 
approach  is  to  write  a program  and  work  out  any  problems  as  they  come  up, 
and  the  linguistic  engineers  of  early  machine  translation  days  fell 
into  the  same  error  (see,  e.g.,  Garvin  1966  for  a defense  of  the  practice) 
of  encoding  their  grammars  as  recognition  procedures.  It  is  undeniably 
odd,  however,  that  such  a practice  should  have  persisted  up  to  the  present. 
Outside  the  circles  of  the  artificial  intelligentsia,  the  current  view 
is  rather  more  typified  by  Grishman's  remarks  that  "The  'grammar  in 
program'  approach  which  characterized  many  of  the  early  machine  transla- 
tion efforts  is  still  employed  in  some  of  today's  systems.”  "...research 
goals  should  be  the  ability  to  manage  grammatical  complexity  and  the 
ability  to  communicate  successful  methods  to  others.  In  both  these 
regards,  a syntactic  analyzer  using  a unified,  semi-formal  set  of  rulps 
is  bound  to  be  more  effective  (Grishman  1976)."  Today's  systems,  it 
should  be  noted,  are  more  apt  to  have  some  set  of  formal  rules,  but  then 
to  compromise  this  by  inserting  in  the  rules  the  names  or  addresses  of 
arbitrary  bits  of  program  to  carry  out  procedures  — thus  effectively 
putting  an  essential  part  of  the  grammar  into  programs,  which  are  hard 
to  control. 

Woods's  Augmented  Transition  Networks  (Woods  1969,  1970,  1975) 
are  the  best-known  example  of  such  a procedural  way  of  analyzing  natural 
language.  They  actually  represent  a use  of  structured  vocabulary,  although 
because  of  the  organization  of  the  systems  they  sometimes  give  the  impression 
of  being  completely  ad-hoc  recognition  systems.  They  are,  in  the  usual 
sincere  flattery  claimed  for  this  kind  of  work  , "capable  of  doing 
everything  that  a transformational  grammar  can  do"  (Woods  1970),  in  the 
usual  uninteresting  sense.  In  addition  to  the  use  of  structured  vocabulary 
which  we  shall  attempt  to  identify,  the  ATN  grammars  also  model  themselves 
after  the  special  factored  form  of  a "regular  right-part  grammar"  explained 
in  section  2. A.  They  further  confuse  matters  by  interspersing  actions 
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to  build  a tree  structure  which  is  distinct  from  the  structural  description 
of  the  string  analysed;  this  is  of  no  interest  here,  being  merely  an 
ill-structured  translation  of  a context-free  grammar  made  at  an  incon- 
venient time. 

In  a Woods  ATN,  the  structured  vocabulary  of  the  grammar  is  nowhere 
explicit,  but  is  held  in  the  state  of  various  (software)  registers  over 
time.  When  a subject  NP  is  recognized,  the  features  of  its  head  noun 
are  put  into  a special  "subject  register"  by  an  "action";  then,  later, 
when  a VP  is  at  hand,  the  subject  register  is  interrogated  by  a "condition" 
which  checks  compatibility. 

This  is  obviously  one  possible  low-level  implementation  of  a 
two-level  grammar  — although  Woods's  grammars  have  been  naive  in  details 
such  as  providing  a limited  set  of  grammatical  relations  to  program  into 
registers,  and  in  attaching  features  to  words  alone  so  that  other  actions 
must  "look  inside"  larger  constituents  to  find  the  features  (Burton  and 
Woods  1976). 

It  would  perhaps  be  worth  exploring  the  application  of  a more 
abstract  van  Wijngaarden  approach  to  ATN's  — introducing  many  states 
with  structured  names,  and  so  forth,  while  retaining  the  regular-right- 

part  format  to  see  whether  those  who  like  ATN's  would  like  the  resulting 

version.  Such  a development  would  be  merely  a notational  variant  of  the 
restricted  grammars  with  structured  vocabulary  to  be  introduced  in 
following  sections,  and  would  remove  the  procedural  flavor  from  the 
definition  of  conditions  and  actions,  while  retaining  it  for  the  basic 
recursive  networks. 

A similar  project  using  similar  means  is  the  "Linguistic  String 
Project"  of  Sager,  which  is  influenced  by  Harris's  String  Analysis 
notions  (Harris  1962).  As  Sager  and  Grishraan  observe,  "This  basic 
strategy  of  grammar  design,  in  which  a context-free  framework  is  augmented 
by  a set  of  conditions  written  as  procedures,  has  become  quite  popular; 
it  is  used,  for  example,  in  the  systems  of  Woods  and  Winograd"  as  well 
as  in  their  own  Linguistic  String  Project  system  (Pager  and  Grishman 
1975).  Their  implementation  again  employs  "registers"  which  are  set 
and  checked,  and  if  anything  it  is  less  constrained  than  Woods's  systems 
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because  they  have  invented  a "restriction  language"  in  which  to  program 
the  conditions. 

The  last  of  these  implementations  which  we  shall  mention  is  that 
of  Terry  Winograd.  Winograd  is  much  influenced  by  the  idea  of  the 
compactness  in  a structured  vocabulary  built  on  category  symbols  plus 
features: 

We  allow  each  symbol  to  have  additional  subscripts,  or 
features,  which  control  its  expansion.  In  a way,  this 
is  like  the  separation  of  the  symbol  NP  into  NP/PL  and 
NP/SG  in  our  augmented  context-free  grammar.  But  it  is 
not  necessary  to  develop  whole  new  sets  of  symbols  with 
set  of  expansions  for  each.  A symbol  such  as  CLAUSE  may 
be  associated  with  a whole  set  of  features  (such  as 
TRANSITIVE,  QUESTION,  SUBJUNCTIVE,  OBJECT-QUESTION, 
etc.)  but  there  is  a single  set  of  rules  for  expanding 
CLAUSE.  These  rules  may  at  various  points  depend  on  the 
set  of  features  present.  (Winograd  1971) 

This  is  not  a bad  description  of  the  practical  advantages  of  structured 

vocabulary  and  rule  schemata,  as  we  have  described  them  previously. 

Unfortunately,  Winograd  somehow  came  to  believe  that  M.  A.  K. 

Halliday  is  the  only  professional  linguist  who  shares  this  appreciation 

(probably  because  only  there  did  he  see  explicitly  written  features, 

outside  of  Chomsky),  and  so  Winograd  develops  his  own  grammar  in  the 

form  of  a program  working  on  Halliday 's  hints  about  "systemic  grammar" 

(Halliday  1961).  Winograd  rapidly  programs  himself  into  an  ad-hoc  mess: 

How,  for  example,  can  we  handle  agreement?  One  way  to  do  this 
would  be  for  the  VF  program  to  look  back  in  the  sentence 
for  the  subject,  and  check  its  agreement  with  the  verb 
before  going  on.  We  need  a way  to  climb  around  on  the 
parsing  tree,  looking  at  its  structure. 

...The  call  (*C  DLC  PV  (NP))  will  start  at  the  current 
node,  move  down  to  the  rightmost  completed  node  (i.e., 
not  currently  active)  then  move  left  until  it  finds  a 
node  with  the  feature  NP  (Down-l,ast-Completed,  Previous).... 

When  this  idea  is  elaborated  over  several  years,  the  result  is  a hacker's 

dream.  This  is  precisely  the  sort  of  approach  to  the  advantages  of  a 

structured  vocabulary  from  which  we  are  saved  by  the  invention  of  a two- 

level  van  Wijngaarden  grammar. 
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5.  Restrictions  on  Grammars  with  Structured  Vocabulary 

We  have  seen  in  section  2 above  that  context-free  grammars  of 
the  classic  one-level  type  are  in  practice  unwieldy  for  describing 
natural  languages.  In  section  3,  we  saw  that  by  introducing  a structured 
vocabulary  in  a van  Wijngaarden  "two-level"  grammar,  it  was  possible  to 
overcome  some  of  the  practical  difficulties,  but  that  the  resulting 
class  of  grammars  contains  formidably  complex  systems,  equivalent  to 
unrestricted  rewriting  systems.  More  than  just  being  theoretically 
powerful,  there  is  the  practical  question  of  how  to  apportion  function 
for  maximum  Insight  in  a two-level  grammar. 

In  the  review  of  prior  linguistic  uses  of  structured  vocabulary 
in  section  4,  we  have  seen  that  there  is  a strong  tradition  going  back 
to  pre-Chonsklan  linguistics  to  work  with  basic  units  of  syntactic 
categories,  qualified  by  the  addition  of  features  (tags,  subscripts) 
to  provide  for  agreement  or  co-occurrence  restrictions  and  to  permit 
rule  features  to  collapse  nearly  Identical  rules.  We  saw  that  this  . 
tradition  was  continued  without  question  by  early  implementors  of 
systems  and  languages  for  natural  language  processing;  and  that  because 
the  same  formal  devices  could  be  used  to  encode  "transformations"  or 
tree-manipulation  rules,  that  purpose  was  added  to  the  traditional  uses 
of  structured  vocabulary  by  some  computational  linguists.  It  also 
appeared  that  these  last  extensions  have  been  counterproductive,  and 
that  the  two  purposes  of  abbreviating  a large  context-free  grammar  and 
of  transforming  structural  descriptions  should  be  conceptually  separated. 

Accordingly,  we  examine  in  this  section  restrictions  designed  to 
model  a grammar  with  category  symbols  and  features,  as  a restriction 
on  the  form  of  van  Wijngaarden  grammars.  These  restrictions  are  not 
Introduced  to  alter  to  weak  generative  capacity  of  the  grammars  (the 
restricted  grammars  are  still  type-0  grammars),  but  they  do  restrict 
grammars  to  a class  which  is  easier  to  write  and  easier  to  parse.  More- 
over, by  restricting  the  format  of  grammars  it  is  possible  that  the 
notation  can  be  made  less  cumbersome. 

The  restrictions  proposed  here  are  somewhat  similar  to  those  in 
(Greibach  1974),  which  reduce  the  generative  capacity  of  the  grammars 
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to  a sub-class  of  context-sensitive  languages  and  yield  other  pleasant 
theoretical  properties;  but  here  our  interest  is  exclusively  in  the 
practical  ease  and  naturalness  of  the  grammars  when  written  by  people 
to  describe  natural  languages. 

It  is  likely  that  grammars  of  the  class  described  in  this  section 
can  be  written  which  would  be  adequate  to  serve  as  recognition  grammars 
for  parsing  natural  languages,  in  such  practical  applications  as  machine 
translation.  These  grammars  do  not  effect  any  inter-language  transformation 
or  correspondence,  which  would  be  left  to  a separate  component. 

5.1  A Restricted  van  Wljngaarden  Grammar 

In  this  section  we  will  introduce  a simple  van  Wljngaarden  grammar 
to  define  an  artificial  language;  its  interest  lies  in  the  fact  that 
the  grammar  is  constructed  using  a restricted  part  olr  the  potential 
techniques  available  in  a van  Wljngaarden  grammar.  Generally,  we  mean 
to  restrict  every  non-predicate  hyper-symbol  to  be  a single  proto-symbol 
(the  "category  symbol")  and  a string  of  meta-symbols  (the  "features”), 
and  further  to  require  that  each  proto-symbol  is  always  accompanied  by 
the  same  set  of  meta-symbols.  Only  predicate  hyper-symbols  (those  which 
dominate  only  EMPTY  or  blind  alleys)  are  not  restricted  in  this  way. 

The  style  of  grammar  which  results  from  these  restrictions  will  be  shown 
in  the  following  sections,  where  alternative  notations  for  such  restricted 
grammars  will  be  shown  and  exemplified  by  transcribing  the  same  grammar 
into  them.  Following  these  samples,  a more  complex  grammar  related  to 
natural  languages  (though  still  simplified  for  exposition)  will  be 
exhibited  in  all  three  notations. 

Suppose  we  wish  to  define  a language  in  which  names  may  be  "declared" 
and  "used",  in  a way  much  like  in  ordinary  programming  languages.  (This 
example  language  is  modeled  after  that  of  (Watt  1977),  and  is  related 
to  the  larger  example  to  follow  which  concerns  the  proper  characterization 
of  quantified  logical  formulas.)  In  this  language  every  name  must  be 
declared  (as  "new  information")  before  it  is  used  (as  "old  information"); 
no  name  may  be  declared  more  than  once,  and  every  name  must  be  declared 
before  it  is  used.  (Names  may  be  declared  without  being  used,  however, 


\ 

i 

> 
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and  may  be  used  more  than  once  after  being  declared.)  This  language 
resembles  a programming  language  with  semi-strict  declarations  and 
without  block  structure. 

For  example,  good  strings  in  this  language  are  ones  such  as 
del  x use  x end 

in  which  x is  declared  (del)  and  then  used  (use) , and  other  good  strings 

would  be:  ! 

del  x del  y use  y use  x end 

del  z del  x del  y use  x use  z use  z end 

del  x use  x use  x use  x end 


But  this  language  does  not  include  strings  such  as  the  following: 

*dcl  x use  x use  y end  (no  declaration  for  y before  use) 

*dcl  x del  x use  x end  (x  declared  twice) 

We  will  be  employing  only  the  names  x,  y,  and  z,  so  special  ad-hoc 
methods  could  be  used  to  define  this  language;  however,  it  is  a well- 
known  theorem  that  in  general  languages  like  this  one  are  not  context- 
free  languages  (have  no  grammars  which  are  context-free  grammars),  so 
this  will  serve  as  a sample  of  a language  which  has  no  context-free 
grammar. 

A context-free  grammar  which  gives  strings  of  the  correct  general 
form  is  the  following,  which  generates  "programs"  composed  of  a "list" 
of  declarations  followed  by  a "sequence"  of  uses: 

program  list  . seq  end 

list  del  name  I list  del  name 

seq  X I seq  use  name 

name  “*  x I v I z 

(As  before  lower-case  lambda  lU  is  the  empty  string.)  A typical 
structural  description  derived  in  this  grammar  would  be: 
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program 


del  x del  x 


use  y end 


Of  course,  as  this  sample  indicates,  the  restrictions  on  declaration 
and  use  are  not  observed  in  this  grammar,  so  that  the  tree  above  is  generated 
by  the  grammar,  but  the  terminal  string  is  not  one  which  we  wished  to  be 
in  the  language  (x  is  declared  twice  and  y is  used  without  declaration). 

We  can  correct  this  flaw  by  employing  a two-level  van  Wijngaarden  grammar 
in  which  the  list  of  declarations  and  the  sequence  of  uses  are  constrained 
to  be  the  same  members,  after  which  the  declarations  are  peeled  off 
one  by  one  (destructively)  and  the  uses  are  checked  for  membership.  Such 
a grammar  is  the  following: 

Heta-tules: 

ml . NAME  : : -*  x I y ' z 
m2 . SET  : •'  - NAME  I SET  NAME 
m3 . EMPTY  ; : -*  X 
Restricted  Hyper-rules: 

hi.  (program)  ; ■*  (list  SET)  ( seq  SET)  ( end ) 


h2.  < list  SET)  < del)  (name  NAME) 

(where  NAME  is  not  in  EMPTY) 

(where  SET  is  union  of  EMPTY  NAME)  I 
(list  SET1 ) < del ) (name  NAME) 

( where  NAME  is  not  in  SET1 ) 

(where  SET  is  union  of  SET1  NAME) 
h3 . ( seq  SET)  : - < EMPTY)  I 

( seq  SET ) ( use ) ( name  NAME ) 

(where  NAME  is  in  SET) 
h4 . (name  NAME)  : (NAME  terminal) 

Additional  mcta-rules  and  hyper-rules  to  define-  predicate  hyper-symbol 
(first-time  readers  may  ship  the  remaining  rules): 

m4 . OPTIONAL  : ; NAME  I EMPTY 

h5.  (where  NAME  1 is  not  in  NAME 2 SET>  : 

V where  NAMEl  is  not  NAME2 ) 

(where  NAMEl  is  not  in  SET) 
h6.  (where  NAMEl  is  not  in  NAME  2 > • -* 

(where  NAMEl  is  not  NAME 2 > 
h?.  (where  NAMEl  is  not  NAME2 ) ; 

(where  NAMEl  precedes  NAME 2 in  xyz) 

I (where  NAME2  precedes  NAMEl  in  xyz) 
hS.  (where  NAMEl  precedes  NAME 2 in  OPTIONALl  NAMEl  0PTI0NAL2 
NAME 2 OPTIONAL3  ) : 

< EMPTY ) 
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h9.  ( where  NAME  is  not  in  EMPTY)  : -* 


< empty) 


hlO.  < Where  NAME1  is  in  NAME2  SET>  : 

< where  NAME1  is  NAME2 ) 

I ( where  NAME1  is  in  SET) 
hll.  < where  NAME1  is  in  NAME2 ) : -* 

i where  NAME1  is  NAME 2 > 
h!2 . < where  NAME  is  NAME ) • 


hi  3. 


hl4. 


hi  5. 


hl6. 


h!7. 


< EMPTY) 

(where  EMPTY  is  in  EMPTY)  : -* 

< EMPTY ) 

( where  SET1  is  union  of  5ET2 ) '■  “* 

( where  SET1  is  subset  of  SET2 
( where  SET2  is  subset  of  SETi 
(where  SETI  NAME  is  subset  of  SET2 ) : 

( where  NAME  is  in  SET2 ) 

(where  SETI  is  subset  of  SET2 
(where  NAME  is  subset  of  SET)  ; ** 

(where  NAME  is  in  SET) 

(where  EMPTY  is  union  of  EMPTY)  : ** 

( EMPTY  > 


) 

> 

> 


This  van  Wijngaarden  grammar  generates  exactly  the  language  specified, 
with  all  restrictions  observed.  (The  detailed  syntax  of  the  predicate 
hyper-symbols  in  rules  h5  — hi 7 will  not  be  discussed  here,  but  generally 
follows  the  pattern  Introduced  in  section  3.5  above,  with  which  comparison 
would  be  useful.) 
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A sample  structural  description  in  this  grammar  would  be: 


< program) 


< seq  xy ) ( use ) \ name  y ) ( x terminal ) 

( EMPTY ) ( y terminal > 

del  y del  x us*  y use  x use  x use  y end 

(This  grammar,  for  simplicity,  uses  the  convention  that  either  the 
underlined  symbols  o£  a production-symbol  ending  in  'terminal'  are 
terminals.)  The  predicate  hyper-symbols  in  the  hyper-rules  enforce  that 
only  one  declaration  per  name  can  take  place  in  the  left  branch  and  that 
names  used  on  the  right  are  in  the  declared  set;  but  observe  that  the 
extensive  sub-trees  dominating  only  EMPTY  and  headed  by  predicate  symbols 
have  been  suppressed  in  the  tree  above.  A fuller  diagram  would  be  much 
more  complicated  at  every  level  — a sample  level  would  be: 
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ih.it  it  tr 


This  shows  all  the  rule  applications  which  had  to  be  possible  in  order 
for  the  rule  rewriting  <listxy?  as  list  y > -d  del  > ( name  x N < where 
x is  not  in  y)  < where  xy  is  union  of  yx  > to  be  part  of  a valid  derivation. 
The  last  two  hyper-symbols  (the  two  predicates)  and  all  their  dominated 
material  was  removed  from  the  structural  description  because  it  dominated 
only  EMPTY  terminals.  Trees  like  this  are  generated  at  each  declaration, 
and  a smaller  tree  is  generated  at  each  use.  If  the  sample  derivation  had 
not  been  correct,  then  some  of  the  predicate  nodes  would  not  have  been  able 
to  generate  only  EMPTY,  because  the  derivation  would  have  blocked  at  a 
non-terminal  which  could  not  be  re-written. 

It  should  be  clear,  looking  at  this  grammar,  how  proto-symbols 
such  as  'list'  and  'seq'  are  used  as  category  symbols,  while  meta-symbols 
such  as  'SET'  and  'NAME'  arc  used  as  features.  The  production-rules 
generated  from  the  basic  hyper-rules  will  re-write  a category  symbol  and 
any  possible  feature  set,  leaving  it  to  the  rules  re-writing  predicate 
hyper-symbols  to  block  the  derivations  containing  features  which  are  not 
correctly  arranged. 

5. 2 Roster's  Affix  Grammars 

In  the  preceding  section  we  exhibited  a correctly-matched  structural 
description  and  associated  string  in  the  language  of  the  sample  van 
Wijngaarden  grammar.  But  it  is  a nice  question  how  we  could  have  started 
with  the.  string  and  the  grammar,  and  discovered  the  structural  description 
(if  any).  We  could  not,  for  instance,  have  expanded  the  van  Wijngaarden 
rules  out  to  their  equivalent  production  rules,  since  there  are  an 
unbounded  number  of  production  rules  produced  by  the  first  hyper-rule  alone: 

SET  '■  : - NAME  I SET  NAME 

(program)  : (list  SET)  (seq  SET)  ( end  ) 

Any  value  of  SET  can  be  used  to  form  a production  rule,  even  though  not 
all  sets  initiate  valid  derivations  (for  example,  those  with  repeated 
instances  of  the  same  name  do  not). 

Clearly  we  need  to  begin  the  other  way,  and  to  find  some  method 
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to  discover  from  a string  to  be  parsed  what  rules  are  relevant  to  it  -- 
and  then  see  If  those  rules  exist.  One  proposed  way  to  do  this  Is  to 
re-fcrmulate  a van  Wljngaarden  grammar  as  a different  but  related  kind  of 
"affix  grammar"  specifically  designed  to  make  this  strategy  trivial 
(Koater  1971,  1974b,  1975,  Crowe  1972,  Watt  1977).  This  must  be  done  by 
hand,  since  not  every  van  Wljngaarden  grammar  can  be  so  re-written, 
and  the  conversion  la  not  mechanical.  But  we  will  examine  affix  grjimmars 
here,  anyway,  since  they  employ  a notation  which  is  related  to  our  restricted 
van  Wljngaarden  grammars  and  since  they  do  suggest  intuition  as  to  how 
parsing  could  proceed. 

An  affix  grammar  can  be  seen  as  a restricted  van  Wljngaarden  grammar 
which  (like  those  of  the  preceding  section)  contains  only  non-predicate 
hyper-symbols  which  consist  of  a single  proto-symbol  and  a string  of 
raeta-syiabols,  always  the  same  string  for  any  single  proto-symbol.  But 
in  addition,  for  each  appearance  of  any  meta-symbol  in  a rule  the  grammar 
writer  must  specify  whether  it  is  "inherited"  or  "synthesized"  that  is, 
whether  Its  value  in  that  appearance  depends  on  the  values  of  symbols  in 
its  constituents  alone  (synthesized),  or  whether  its  value  depends  in 
part  on  the  context  of  its  use  (inherited).  Finally,  predicates  are 
defined  in  non-syntactic  ways  (for  convenience),  and  may  compute  the 
values  of  some  meta-symbols.  Since  the  meta-symbols  in  this  sort  ot  grammar 
so  clearly  qualify  the  proto-symbols,  they  are  called  "affixes". 

As  an  example,  we  give  here  an  affix  grammar  for  the  same  language 
defined  in  the  previous  section.  The  new  notation  here  consists  of  a 
rising  arrow  T to  precede  synthesized  affixes,  and  a down  arrow  to 
precede  inherited  affixes.  This  definition  can  be  compared  rule  for 
rule  with  the  van  Wljngaarden  grammar  given  before. 

ml . NAME  : : x ' y I z 

m2 . SET  : : NAME  I SET  NAME 

m3 . EMPTY  ! : -»  X 


(affix-style  hyper-rules:) 


hi.  (program)  : -*■  (list  tsET)  (seq  ISET)  ( end) 
h2.  < list  tSET)  : *♦  < del)  (name  tNAME) 

< add (EMPTY, NAME) re turn (SET) ) I 
( list  tsETl > ( del)  (nametNAME) 

< add (SET1 , NAME) return (SET) > 
h3 . ( seq  ISET)  : “*■  (EMPTY)  I 

(seq  ISET ) ( use ) (name  tNAME ) 

( identify (SET, NAME) ) 
h4.  (name  tNAME)  : (tNAME  terminal) 

Here  "add”  and  "identify"  are  names  of  predicates;  they  are  defined 
procedurally,  by  giving  the  ranges  of  their  parameters  and  a specification 
of  their  actions: 

Predicates: 


input 

result 

name 

parameters 

parameters 

function 

add 

1.  SET1 

2.  NAME 

3.  SET 

if  NAME  € SET1 
then  fail 

else  SET  SET1  U {NAME) 

fi 

identify 

1.  SET 

2.  NAME 

if  NAME  € SET 
then  succeed 
else  fail 
fi 

It  will  be  observed  that  the  basic  rules  here  are  typographically 
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almost  Identical  to  the  corresponding  rules  of  the  van  Wijngaarden 
grammar,  but  these  rules  contain  extra  information. 

Interpreted  as  a van  Wijngaarden  grammar,  this  affix  grammar 
defines  the  same  language  in  exactly  the  same  way  as  before  — that  is, 
the  language  generated  by  production-rules  which  can  be  obtained  from 
the  affix-grammar  rules  by  the  Uniform  Replacement  Convention.  But 
hidden  In  the  arrows  is  a further  assertion  that  a possible  parsing 
strategy  would  be  to  construct  a parse  tree  according  to  the  proto-symbols 
alone,  and  then  check  the  affixes  in  the  order  indicated  by  the  arrows, 
letting  the  't'affixes  move  up  the  tree,  while  the  I affixes  move  down. 

For  example,  the  parse  tree  corresponding  to  an  example  similar  to  the 
one  given  before  would  be: 
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In  this  diagram,  the  dotted  arrows  running  through  the  affixes  of  every 
node  describe  a feasible  pattern  of  checking  which  would  be  effr*  ive; 
beginning  with  the  terminal  names,  values  are  synthesized  upwar.'  and 
inherited  back  down,  computing  the  value  of  all  the  affixes  and  checking 
the  compatibility  of  them.  In  general,  of  course,  the  declarations  are 
gathered  moving  up  the  left  side  of  the  tree,  and  then  uses  are  checked 
moving  down  the  right  side  of  the  tree. 

Without  going  too  deeply  into  the  restriction  philosophy,  we 
can  distinguish  in  any  hyper-rule  affixes  which  are  in  "defining  occurrences" 
from  those  which  are  in  "applied  occur  fences . " Defining  occurrences  are 
those  of  (i)  inherited  affixes  on  the  left  side  of  a rule,  or  (ii) 
synthesized  affixes  on  the  right  side  of  a rule.  Applied  occurrences 
are  just  the  opposite:  (i)  synthesized  affixes  on  the  left  side,  or 
(ii)  inherited  affixes  on  the  right  side.  For  example,  in 

< list  Tset>  <dcl_>  ( name  tNAME ) 

< add (EMPTY, NAME) return (SET)  > 


| . 


the  synthesized  affix  t NAME  is  "defining"  on  the  right,  while  the 
synthesized  affix  t'SF.T  is  "applied"  on  the  left;  this  corresponds  to 
passing  information  up  the  tree  while  parsing,  and  the  predicate  "add" 
computes  a value  for  SET  using  the  value  of  NAME. 

Affix  grammars  in  this  form  are  usually  subject  to  some  additional 
constraints,  because  they  have  been  defined  for  programming  languages 
where  the  goal  is  to  parse  very  rapidly,  and  the  constraints  make  it 
possible  to  parse  in  one  pass  over  the  input  string  left  to  right, 
propagating  affixes  through  the  tree  in  the  same  pass  as  parsing  is 
completed.  We  will  not  explore  the  effect  of  these  constraints  here, 
because  for  natural  language  analysis  a looser  set  of  constraints  is 
inevitable  — and  that  is  the  subject  of  the  following  section. 

What  is  notable,  however,  is  that  the  restriction  of  van  Wijngaarden 
grammars  to  be  essentially  context-free  grammars  with  features  and  tests 
rapidly  moves  them  quite  far  down  the  scale  of  generality,  so  that 
convenient  parsing  algorithms  become  available. 


And,  as  we  might  expect,  the  resulting  gramnars  are  also  easier  for 
humans  to  read  and  understand.  Simonet  (1977)  suggests  defining  programming 
languages  by  van  Wljngaarden  grammars,  but  restricted  van  Wljngaarden 
grammars  modeled  after  affix  grammars  for  ease  of  human  use.  That 
suggestion  seems  plausible,  given  the  widespread  popularity  of  the  same 
representation  for  natural  language  analysis. 

5.3  Knuth' s Attribute  Grammars 

Still  another  variant  of  a grammar  with  structured  vocabulary 
is  the  sort  Introduced  by  Donald  Knuth  and  now  generally  referred  to 
as  an  "attribute  grammar"  (Knuth  1968,  1971,  Fang  1972,  Wilner  1972). 
Although  originally  motivated  by  rather  different  goals,  attribute 
grammars  may  for  our  purposes  be  considered  as  simply  a set  of  notational 
proposals  for  affix  grammars  or  van  Wljngaarden  grammars. 

The  advantage  of  the  Knuth  approach  is  that  it  is  once  again 
quite  easy  to  achieve  an  abstract,  "declarative"  way  of  looking  at 
the  grammar.  In  a van  Wljngaarden  grammar,  the  declarative  view  is 
obvious:  a class  of  well-formed  strings  and  their  associated  structural 
descriptions  is  characterized,  but  no  procedure  for  parsing  is  implied. 

Affix  grammars,  by  contrast,  use  a notation  which  suggests  preoccupation 
with  passing  things  around,  one  step  after  another.  A little  bit  of 
this  is  useful  to  suggest  how  practical  implementations  could  be  achieved, 
but  to  insist  on  this  procedural  view  is  to  complicate  the  task  of 
writing  a grammar  by  raising  to  notice  just  those  details  which  it  is 
the  glory  of  a grammar  to  suppress.  Affix  grammars  can  be  viewed  as 
declarative,  with  a bit  of  effort,  but  once  they  are  viewed  in  that 
way  the  Knuth  approach  may  recommend  itself  as  more  natural. 

A first  orientation  to  attribute  grammars  should  include  a warning 
that  (unlike  affix  grammars  ) attribute  grammars  are  not  (explicitly) 
two-level  grammars,  were  not  motivated  by  the  syntax  of  Algol  68,  nor 
were  they  introduced  to  admit  of  efficient  parsing.  Instead,  attribute 
grammars  were  proposed  by  Knuth  as  a way  to  specify  the  "Semantics  of 
Context-Free  Languages"  (the  title  of  Knuth  1968).  The  idea  was  to 
associate  "attributes"  with  the  categories  of  a context-free  grammar, 
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U:;  •!luo  t0  L'“e  '■’.duos  of  other  attributes  in  the  same  rule. 

A .'.ample  r.rammar  (Lor  th.e  same  language  as  those  of  the  preceding 
soi  l u:ns,  with  which  the  rules  may  once  ai-aiu  be  compared  one  for  one) 

won  h h,' : 

Attributes : 


name 

inherited,  or 
synthesized 

type  value 

for 

category  symbols 

NAME 

synthesized 

strina 

name 

: a . ; . ■*  'T' 

synthesized 

set  of  names 
declared 

list 

LTS  KT 

inherited 

set  of  names 
used 

seq 

EE.ih 

s’-' a ti-.es  i ted 

Boo  Lean 

(true  or  false) 

name,  list,  seq, 
i program 

■ 1 V\j'p  ; \- 


: r am 


seq  end  1 . la 


’•ci  name 


list  del  name 


sea 


use  name 


SEMANTICS 

•'SET  ( eq)  = DSET(list) 

1.1b  ERR (program)  = ERR (list)  .OR.  ERR(seq) 
C.la  OSET (list)  = NAME (name) 

- - lb  ERR (list)  = ERR (name) 

2. da  DSET(listl)  = DSET(listd)  U {NAME (name ) } 

2.  b ERR ( listl)  = ERR(listd)  .OR. 

(if  NAME (name)  £ DSET(list2) 
then  -true  else  false) 

■also  ' 


3 . a 

ERR (seq)  - 

3 . ' a 

• USET(seqd) 

3 . db 

ERR ( seq 1) 

•! . La 

NAME (name) 

A . lb 

ERR (name) 

1 44 

(if  NAME  (name)  v;  USET(seql) 
then  false  else  true) 


BEST  AVAILABLE  COPY 


(In  so;., antic  rules,  digits  following  node  names  are  used  to  distinguish 
repeated  instances  of  the  same  symbol  — DSET(listl)  is  the  DSI'.T  attribute 
t!u‘  !irst  occu r tenee  of  the  category  symbol  ’list’  in  the.  rule,  etc.) 

are  two  ways  in  which  this  attribute  grammar  is  more  explicit 
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if  the  above  convention  on  unwritten  rules  is  adopted.)  Clearly,  other 
conventions  about  errors  which  would  be  more  like  those  of  the  affix 
grammar  could  be  used. 

Finally,  the  convention  of  calling  the  context-free  rules  the 
"syntactic  rules"  and  of  calling  the  conditions  on  attribute  values 
the  "semantic  rules"  is  purely  an  historical  artifact  and  new  nomenclature 
can  be  introduced.  The  same  distinction  has  been  so  long  used  in  natural 
language  systems,  however,  where  the  category  symbols  of  "the  syntax" 
are  augmented  with  "semantic  features"  and  tests,  that  there  is  not  much 
risk  of  confusion  in  the  present  context. 

5.4  An  Experiment  with  Three  Notations 

Although  there  is  no  question  of  the  utility  of  grammars  with 
structured  vocabulary  in  defining  syntax,  there  are  many  practical 
questions  about  how  to  go  about  writing  and  grasping  a grammar  large 
enough  to  be  a comprehensive  grammar  for  a natural  language.  Restricted 
van  Wijngaarden  grammars,  in  a sense,  only  work  with  "inherited" 
attributes;  all  information  lower  in  a structural  description  is  imposed 
by  higher  levels.  Knuth  (196S)  points  out  that  alternatively  it  is  always 
possible  to  use  only  "synthesized"  attributes;  the  entire  form  of  the 
tree  can  be  encoded  into  attributes  of  successively  higher  nodes,  and 
then  any  function  of  that  attribute  computed  at  the  root.  But  the 
claim  is  that  an  interplay  of  inherited  and  synthesized  definitions  is 
more  natural,  and  easier  for  people  to  think  about. 

It  is  by  no  means  clear  whether  or  not  this  is  true,  so  as  an  aid 
in  evaluating  the  claim  we  present  here  three  definitions  of  the  same 
language,  this  time  a language  more  closely  related  to  the  non-context- 
free  phenomena  of  natural  languages. 

Natural  language  examples  tend  to  be  very  large  relative  to  what 
they  reveal,  so  we  once  again  consider  a language  abstracted  from  natural 
language  studies  so  as  to  be  able  to  write  revealing  grammars  in  reasonable 
space.  This  sample  draws  on  material  familiar  to  linguists,  and  so  should 
give  the  flavor  of  the  enterprise.  Such  abstract  structures  as  are  used 
here  are  not  envisioned  as  playing  a part  in  any  machine  translation 
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system  directly,  although  analogous  problems  arise  in  machine  translation. 

The  problem  is  that  of  specifying  the  set  of  derivation-initial 
phrase-markers  for  a grammar  of  the  sort  associated  with  "Generative 
Semantics".  In  such  a "semantically-based"  grammar,  one  needs  something 
like  a base  component  to  initiate  derivations;  as  McCawley  says,  "the 
closest  analogue  to  the  base  component  of  Aspects  is  a set  of  rules 
specifying  what  semantic  representations  ate  well-formed"  (McCawley  1973). 
Although  no  such  set  of  rules  has  ever  been  made  explicit,  in  general 
we  are  talking  about  structures  in  which  quantified  noun  phrases  originate 
in  higher  sentences,  and  in  which  clauses  of  the  ordinary  type  contain 
only  references  to  the  indexes  of  these  noun  phrases. 

A tiny  context-free  grammar  for  such  structures  could  be: 

1.  tops  “*  # s # 

2 . s “*  qp  s 

3.  qp  -*  q binding-np 

4.  s pred  arg  arg 

5.  arg  “*  3 

6.  arg  “*  bound-np 

We  will  imagine  that  lexical  terminals  appear  under  "q"  (quantifier), 
"pred"  (predicate),  "binding-np",  and  "bound-np"  by  some  auxiliary 
non-grammat  ical  process.  This  is  perhaps  not  the  best  grammar  for  such 
structures,  but  it  has  the  advantage  of  being  very  simple;  still  more 
interesting,  though  longer,  examples  could  he  constructed  around  the 
structures  of  (McCawley  1973,  pp.  79ff.). 

A sample  derivation  in  this  grammar  might  be  (with  lexical  terminals 
inserted) : 
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L 


tops 


men . 


bocks 


which  could  possibly  underlie  "Few  men  read  many  books." 

But  the  context-free  base  rules  are  not  adequate  to  define  the 
structures  we  want,  since  they  could  equally  well  produce  the  trees  such 
as  the  following  (assuming  that  the  lexical  insertion  process  is  not 
smart,  but  merely  randomly  inserts  lexical  nouns): 
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s 

pred  arg  arg 

read  bound  bound 

np  np 

men,  books. 

i — ■■■■" 1 k 

This  tree  has  two  binding  occurrences  for  men^;  that  is  not  possible,  so 
the  outer  men^  would  be  a vacuous  quantifier.  Books^  appears  free  in 
this  expression,  but  it  should  be  bound;  it  lacks  a defining  quantifier 
expression.  Thus,  this  tree  could  not  underlie  any  well-formed  expression 
since  it  is  incoherently  formed. 

So  we  should  add  to  our  context-free  grammar  a requirement  that 
every  bound-np  must  be  the  same  as  one  and  only  one  binding-np  above  it, 
must  be  the  same  index  as  the  lowest  Instance  of  the  same  name  above  it, 
and  there  must  be  no  excess  binding -np’s. 

Since  these  linguistic  trees  were  designed  to  be  parallel  in  structure 
to  logical  formulae,  it  is  not  surprising  that  these  requirements  are 
the  same  as  the  requirements  for  "coherent  quantification"  in  the  quanti- 
flcational  expressions  used  by  logicians.  But  the  fact  is  that  such 
languages  are  not  context-free  (have  no  context-free  grammars).  (This 


q binding  qp 
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fact  as  applied  to  logical  formulae  ha a been  discussed  by  Janet  Dean 
Fodor  1970;  McCawley  has  discussed  the  difficulty  in  specifying  a base 
grammar,  but  without  giving  this  natural  linguistic  characterisation  of 
it.) 

Obviously  this  logical  language  is  related  to  the  one  dlscusaed 
in  the  preceding  three  sections,  but  it  is  more  complicated  in  two  ways. 
First,  the  same  "name"  may  be  introduced  several  times  with  different 
meanings,  so  long  as  each  layer  of  binding  and  bound  occurrences  meets 
all  the  tests,  and  so  the  syntax  muat  sort  out  such  multiple  usea. 

Second,  vacuovw  binding  occurrences  are  not  permitted,  so  it  is  necessary 
to  impose  the  requirement  that  every  name  (every  binding  use  of  a name) 
has  a bound  use  somewhere  else. 

A van  Wijngaarden  grammar  for  nuch  a language  is  extremely  straight- 
forward to  set  down,  although  it  is  rather  tricky  to  read  and  understand. 
The  important  part  is  the  first  six  hyper-rules  (corresponding  to  the 
six  context-free  rules),  which  have  predicate  symbols  that  require  that 
every  sentence  ha»*  a properly-formed  table  consisting  of  "layers"  of 
bindings  which  is  consistent  with  the  bindings  actually  present,  that  a 
parallel  table  of  uses  also  contain  all  bindings,  and  that  bound  variables 
should  be  the  only  item  used  in  their  tables  and  should  be  identified 
in  the  proper  scope  of  the  bindings  table. 

The  characterisation  of  well-formed  tables  of  names  and  binding 
occurrences  also  turns  out  to  be  straightforward,  if  tedious,  in  purely 
grammatical  terms.  A well-  formed  table  is  a set  of  pairs  (TEXT,  IDENT) 
divided  into  layers  by  the  punctuation  mark  "new".  All  names  must  be 
distinct  in  every  layer.  The  same  name  TEXT  may  occur  in  different  layers, 
but  every  unique  name  IDENT  must  be  unique  in  the  table.  Observe  that 
the  meta-rules  generate  all  the  possible  TABLES  there  are  for  the  hyper- 
rules, and  it  is  only  the  predicate  tests  which  restrict  the  tables  to 
be  well-formed. 

Meta-rules: 


ml.  ALPHA  : : ■*  a I b I c 


x I y I z 


m2 . STRING  : : -*  ALPHA  I STRING  ALPHA 
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m3. 

EMPTY  : '•  -*■ 

X 

m4. 

STRINGETY 

: : STRING  1 EMPTY 

m5 . 

LETTER  : : "*  letter  ALPHA 

m6. 

TEXT  : ; ■* 

LETTER  1 TEXT  LETTER 

m7 . 

NUMBER  : : i 1 ii  1 iii  ' ... 

1 iiiiiiiii 

m8. 

DIGIT  : •'  ** 

digit  NUMBER 

m9. 

IDENT  : : - 

DIGIT  i IDENT  DIGIT 

mlO. 

DEF  ; : { 

TEXT  , IDENT  ) 

mil . 

DEFS  : : -* 

DEF  ! DEFS  DEF 

ml2 . 

DEFSETY  : : 

“*■  DEFS  1 EMPTY 

mil. 

NEW  : : “*  new 

ml4 . 

TABLE  : : - 

new  DEFS  1 TABLE  new 

DEFS 

ml5. 

BOUND  : : - 

TABLE 

ml6. 

USED  '•  ; -* 

TABLE 

m!7. 

ALPHABET 

1 : abcdefghijklmnoparstuvwxyz 

Basic  Hyper-rules: 

hi.  ( tops ) : (*  symbol ) (s  BOUND  USED)  (#  symbol) 

< where  BOUND  is  NEW> 

(where  USED  differs  from  NEW) 

h 2.  <3  BOUND  USED)  : < qp  DEFS>  <s  BOUND  NEW  DEFS  USED  > 

(where  BOUND  NEW  DEFS  is  a well-formed  table) 
(where  BOUND  NEW  DEFS  is  a subset  of  USED) 
h3.  ( qp  DEF11  ■' "*  \q)  ( binding-np  DEF  ) 


156 


itiTiiiTrii'riiiitri  UltiS'il 
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h4.  <s  30CJND  USED1  > : *♦  <pred>  < arg  BOUND  USED2  > (arg  BOUND  USED 3 ) 

< where  USED1  is  union  of  USED2  and  USED3> 
h5 . < arg  BOUND  USED  > : “*  < s BOUND  USED  > 
h6 . < arg  BOUND  USED  > : -*■  < bound-np  DEF  > 

(where  DEF  is  identified  in  BOUND) 

(where  NEW  DEF  is  USED> 

Hyper-rules  to  expand  predicate  hyper -symbols 

h7 . (where  TABLE  new  DEFSETY  is  a well-formed  table)  : 

(where  TABLE  is  a well-formed  table) 

(where  DEFS  is  a well-formed  layer) 

(where  DEFS  does  not  confuse  TABLE) 
h8.  (where  DEFSETY  (TEXT , IDENT)  is  a well-formed  layer)  : 

(where  (TEXT,  is  not  in  DEFSETY) 

(where  , IDENT)  is  not  in  DEFSETY) 

(where  DEFSETY  is  a well-formed  layer) 

I (where  DEFSETY  is  EMPTY) 

h9 . (where  (TEXTl , is  not  in  (TEXT2 , IDENT)  DEFSETY)  : -* 

(where  TEXTl  differs  from  TEXT2) 

(where  (TEXTl,  is  not  in  DEFSETY > 

1 (where  TEXTl  differs  from  TEXT2> 

(where  DEFSETY  is  EMPTY > 

hlO.  (where  , IDENT 1)  is  not  in  \TEXT, IDENT2)  DEFSETY > : - 

(where  I DENT 1 differs  from  ID ENT 2 ) 

(where  , IDENT1)  is  not  in  DEFSETY) 

I (where  IDENTl  differs  from  I DENT 2 ) 

(where  DEFSETY  is  EMPTY) 

lr>7 


hll.  ( where  DEFS  does  not  confuse  TABLE  new  DEFSETY ) : 

( where  DEFS  does  not  confuse  TABLE) 

( where  DEFS  does  not  confuse  DEFSETY  > 

hl2 . (where  DEFSETY  (TEXT,IDENT)  does  not  confuse  DEFS > : 

(where  , IDENT)  is  not  in  DEFS ) 

( where  DEFSETY  does  not  confuse  DEFS  ) 
hl3.  (where  DEF  is  identified  in  TABLE  new  DEFSETY)  : ** 

( where  DEF  resides  in  DEFSETY ) 

I (where  DEF  is  independent  of  DEFSETY) 
( where  DEF  resides  in  TABLE ) 
hl4.  (where  DEF1  resides  in  DEFS  DEF 2 ) ■ 

(where  DEFl  resides  in  DEFS) 

I (where  DEFl  resides  in  DEF2 ) 
hi  5.  (where  (TEXT1,  IDENT1)  resides  in  (TEXT2 , IDENT2)  ) : 

(where  TEXT1  is  TEXT2  > 

(where  IDENTl  is  IDENT2 ) 

hl6.  (where  USEDl  is  union  of  USED2  and  USED3 ) : 

(where  USEDl  is  setequal  to  USED2  USED3 ) 
hl7.  (where  USED  is  setequal  to  BOUND > : 

(where  USED  is  subset  of  BOUND) 

(where  BOUND  is  subset  of  USED) 
hl8.  (where  TABLE  1 new  DEFSETY  is  subset  of  TABLE  2 > : 

(where  TAB LEI  is  subset  of  TABLE 2 ) 

(where  DEFSETY  is  subset  of  TABLE 2 ) 
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hl9.  < where  DEFSETY  DEF  is  subset  of  TABLE)  : 

( where  DEF  is  identified  in  TABLE) 
( where  DEFSETY  is  subset  of  TABLE) 
h20.  ( where  EMFTY  is  subset  of  TABLE)  : 

< EMPTY ) 


h2i . < where  STRINGETYi  ALPHAl  differs  from  5TRINGETY2  ALPHA2 ) : 

< where  STRINGETYI  differs  from  STRINGETY2 ) 

I Where  ALPHAl  precedes  ALPHA2  in  ALPHABET) 
I ( where  ALPHA2  precedes  ALPHAl  in  ALPHABET) 
h22.  ( where  STRING  differs  from  EMPTY)  : -* 

< EMPTY ) 

h23.  ( where  EMPTY  differs  from  STRING)  : ** 


< EMPTY  > 

h24.  < where  ALPHAl  precedes  ALPHA2  in  STRINGETYI  ALPHAl 

STRINGETY2  ALPHA2  3TRINGETY  3 ) : “► 

< EMPTY  > 

h25 . < where  STRINGETY  is  STRINGETY  ) : 

< EMPTY ) 

,.\n  affix  grammar  for  the  same  language  is  considerably  more  detailed 
(in  the  part  corresponding  to  the  first  six  hyper-rules,  naturally,  since 
there  Is  nothing  to  correspond  to  the  rest  of  them).  It  is  no  longer 
possible  to  just  take  the  lofty  view  that  we  did  in  the  van  Ul jngaarden 
grammar  that  every  binding  variable  must  be  used  and  every  variable  used 
must  bo  bound;  now  we  must  devise  an  explicit  arrangement  whereby  a list 
of  binding  occurrences  is  passed  down,  while  both  lists  of  binders  and 
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uses  are  passed  up  to  be  compared  at  the  root. 
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Affix  grammar  rules: 

1.  <tops>  : <#  symbol)  <s  IBOUND  tuSED  tlNTR)  <#  symbol) 

( assignset({}) returns (BOUND)  ) 

< che ckuse (INTR, USED)  > 

2.  <s  iBOUNDl  tuSED  t INTRl  ) : < qp  t INTR2  tBDING) 

<s  Abound 3 tusED  t intr3 > 

< assignset(INTR2  U INTR3) returns ( INTRl)  > 

< ovemdepairs  (30UND1 , BDING)  returns  (BOUND3)  ) 

3.  <qp  t INTR  t BDING ) :-  <q)  < binding-np  tTEXT  t IDENT > 

< ass ignset ( { IDENT} ) returns ( INTR) ) 

< overridepairs ( (TEXT, IDENT) , {}) returns (BDING) ) 

4.  <s  Abound  tuSEDl  t INTRl ) <pred)  < arg  1 BOUND  tuSED2  t INTR2 ) 

< arg  iBOUND  tuSED3  t INTR3 ) 

< assignset(USED2  U USED3) returns (USEDl)  > 

< assignset(INTR2  U INTR3) returns (INTRl)  > 

5.  ( arg  iBOUND  tuSED  tlNTR>  : "*•  <s  iBOUND  tuSED  t INTR ) 

6.  ( arg  iBOUND  tuSED  t INTR ) : -*  < bound-np  tTEXT ) 

( assignset(O)  returns  (INTR)  ) 

< identify (TEXT, BOUND) returns (USED)  > 
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Predicates : 


input 

name  parameters 

assignset  1.  SET1 


checkuse  1 . SET1 
2.  SET2 


function 
SETl  :=  SET 2 

if_  (SETl  \ SET2)  * {} 

then  succeed 
else  fail 

fi 


output 
parameters  - 

2.  SET2 


overridepairs  1.  PAIRSl  3.  PAIRS 3 

2.  PAIRS 2 


PAIRS3  :=  (PAIRSl 
overridden  by 
PAIRS2) 


identify  1.  TEXT  3.  IDENT 

2.  PAIRS 


if  (TEXT  is  first  member 
of  a pair  in  PAIRS) 
then  IDENT  :*»  (corrasponding 
second  member) 
else  fail 
fi 


And  finally,  an  attribute  grammar  will  he  much  like  an  affix 
grammar  in  its  strategy. 
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Attributes : 


Inherited 

name 

for  symbols 

type  of  value 

comment 

BOUND  s , arg 

Synthesized 

function 

specifies  unique  ident 
for  any  text,  relative 
to  context 

name 

for  symbols 

type  of  value 

comment 

t UNUSED 

tops 

set  of  idents 

vacuous  bindings 

tuSED 

s,  arg 

set  of  idents 

binding  variables  used 
in  this  subtree 

tlNTR 

s,  arg,  qp 

set  of  idents 

binding  variables  intro- 
duced in  this  subtree 

tBDING 

qp 

function 

specifies  ident  of  a 
single  binding  text 

Ttext 

binding-np, 

bound-np 

string 

text  of  variable 
name 

t IDENT 

binding-np 

unique 

number  in  tree 

identification  of 
binding  np 

Convention: 

A tree  is  well-formed  only  if  the  value  of  * UNUSED  at  the  root 
is  the  null  set. 
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Attribute  grammar  rules: 

1 . tops  "*  # s # 

tUNUSED(tops)  = tlNTR(s)  \tUSED(s) 
iBOUND(s)  = 0 

2.  si  qp  S2 

tuSED ( Si ) = tuSED (S2 ) 

tlNTR(Si)  = tlNTR(s2)  U tlNTR(qp) 

lB0UND(S2)  = 4 BOUND (si ) W tBDING(qp) 

3.  qp  — q binding-np 

tlNTR(qp)  = tlDENT (binding-np) 
tBDING(qp)  = {tTEXT (binding-np) , 
tlDENT (binding-np) } 

4.  s pred  argi  arg2 

tuSED(s)  » tuSED (argi ) UtuSED(arg2) 
tlNTR(s)  -tlNTR(argi)  UtiNTR(arg2) 

1 BOUND (argi ) = iBOUND(s) 

1 BOUND (argj ) * iBOUND(s) 

5 . arg  "*  s 

tuSED (arg)  *>  tuSED (s) 
tlNTR(arg)  = tlNTR(s) 

Ibound(s)  » Abound (arg) 

6.  arg  bound- np 

tuSED (arg)  « ABOUND (arg) (tTEXT (bound-np) ) 
tiNTR(arg)  * O 
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A comparison  of  these  three  definitions  is  somewhat  difficult, 
all  the  more  so  because  there  are  choices  of  style  made  in  connection 
with  each  which  are  arbitrary  and  which  could  be  changed,  hut  it  does 
seem  that  the  van  Wijngaarden  grammar  may  offer  "too  smooth  a surface," 
and  conceal  too  much  behind  its  meta-symbols  and  the  predicates  to 
restrict  their  productions.  The  greater  length  of  the  van  Wijngaarden 
grammar  should  not  be  held  against  it  — quite  the  contrary,  since  the 
excess  cones  entirely  in  the  expansion  of  the  predicate  hyper-symbols, 
in  which  every  jot  and  tittle  of  convention  is  formalized  completely 
in  syntactic  terms,  whereas  the  other  definitions  both  rely  on  many 
"understood"  notions  from  mathematics  and  programming  languages.  However, 
it  seems  hard  to  avoid  some  feeling  of  a Turing  machine  simulation  in  the 
strings  of  the  production  symbols  in  a van  Wijngaarden  grammar,  and  it 
is  probably  helpful  to  relax  a little  ns  the  others  do  and  accept  sets 
and  trees  (say)  as  primitives,  with  operations  on  then  directly. 

The  affix  grammar  seems  to  conceal  just  the  wrong  part  of  the 
grammar  in  its  separately-defined  predicates.  The  long  lists  of  (mostly 
redundant)  affixes  are  written  out  in  full,  hut  the  actions  are  hidden 
in  separate  procedures  which  define  relations  among  the  affixes.  The 
reverse  arrangement  of  Knuth's  attribute  grammars  seems  preferable; 
and  it  is  possible  to  read  (and  perhaps  necessary  to  write)  the  attribute 
grammar  in  purely  a "declarative"  frame  of  mind,  treating  the  semantic 
rules  as  static  conditions  on  the  well-formedness  of  feature  sets. 

A practical  notation  for  van  WijngnarJeu  grammars,  based  on  this 
very  limited  experiment,  might  be  to  write  rules  with  conditions  in 
Knuth’s  form.  It  would  doubtless  in  this  case  be  wise  to  add  some 
analogue  of  the  I'niform  Replacement  Convention,  as  wo  discussed  before. 
Observe,  for  instance,  how  rule  5 in  the  van  Vijaganrden  grammar  and  in 
the  affix  grammar  are  exquisitely  simple,  whereas  rule  5 in  the  attribute 
grammar  contains  (predictable)  conditions  to  "pass  along’  every  attribute 
of  the  symbols  in  the  rule. 

5.5  The  Parsing  Problem  for  Restricted  Grammars  with  Structured  Vocabulary 

The  parsing  problem  for  general  van  Wijngaarden  grammars  is  the 
problem  of  parsing  type-0  languages , but  for  restricted  languages  such 
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as  we  have  explored  in  the  last  section,  the  basic  parsing  strategy  is 
clear:  these  are  just  context-free  grammars  plus  '‘a  little  more."  As 
a result,  it  is  not  difficult  to  see  how  to  proceed. 

For  basic  context-free  parsing,  there  is  one  algorithm  of  greatest 
importance,  or  one  family  of  algorithms:  this  is  the  "nodal  spans" 
algorithm  of  John  Cocke  (Cocke  and  Schwartz  1970)  and  its  extension 
with  the  predictive  elements  of  Knuth's  LR(k)  technique  (Knuth  1965) 
by  Jay  Earley  (Earley  1968,  1970).  This  "chart  parser"  technique  (Benson 
1969,  Woods  1975)  is  the  most  flexible  and  general  of  the  parsing  algorithms, 
with  excellent  speed  when  implemented  correctly.  This  algorithm  has 
been  employed  in  the  Quince  system  at  Berkeley,  and  in  its  predecessor 
Syntax  Analysis  System,  since  1967  — perhaps  uniquely,  since  a recent 
survey  of  the  field  ((Irishman  1976)  remarks  that  these  "algorithms  have, 
to  the  best  of  our  knowledge,  not  yet  been  used  in  natural  language 
parsing." 

The  only  remaining  question,  then,  is  how  to  handle  the  "little 
more"  of  the  features. 

If  one  is  given  the  restrictions  customarily  placed  on  affix 
grammars  (not  detailed  here)  (Watt  1977),  then  it  is  possible  to  check 
all  affixes  in  a single  pass  over  a parse  tree,  and  moreover  this  can 
be  interleaved  with  parsing  itself  in  a deterministic  parsing  algoritlun 
(such  as  LR(k)  or  LT.(k)).  It  is  not  clear,  however,  that  a grammar  for 
any  natural  language  could  be  written  conforming  to  these  restrictions. 

For  completely  unrestricted  attribute  grammars.  Fang  (1972)  wrote 
a non-deterministic  parsing  system,  of  unsatisfactory  efficiency.  For 
restricted  attribute  grammars  of  the  sort  considered  here,  however, 
it  is  not  clear  that  such  generality  is  needed,  either. 

The  best  choice  seems  to  be  the  procedure  of  (Bochmaun  1976),  in 
which  a number  of  lef t-to-riglit  passes  over  a parse  tree  are  used  to 
evaluate  all  attributes.  In  practical  cases  it  appears  that  a very  few 
passes  would  be  sufficient,  unless  the  definition  of  attributes  is 
circular,  because  the  depth  of  nesting  in  grammars  for  natural  languages 
is  not  great.  (Eoclinann  includes  an  algoritlun  for  determining  the 
maximum  number  of  passes  necessary  for  an  attribute  grammar,  and  such 
algorithms  can  be  of  practical  use  with  restricted  grammars,  in  spite 
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of  the  demonstration  (Jaznyeri,  Ogden,  and  Hounds  1975)  of  the  intrin- 
sically exponential  complexity  of  the  circularity  problem  for  attribute 
grammars. ) 

The  problem  is  well-understood  in  the  case  of  attribute  grammars 
applied  to  programming  languages  (see,  e.g.,  Lewis,  Rosenkrantz,  and 
Stearns  1976).  A chief  difference  between  these  grammars  and  grammars 
used  for  natural  languages  is  that  programming,  language  grammars  are 
typically  unambiguous  (or  nearly  so)  even  without  considering  attributes. 
The  ambiguity  of  natural  language  grammars  without  considering  attributes 
is  intended  to  be  high  — that  is  part  of  vl.nt  the  attributes  are  for  — 
which  suggests  that  as  much  as  possible  of  the  attribute-processing 
should  be  interf actored  with  parsing  to  eliminate  false  partial  trees 
as  early  as  possible.  This  requirement  could  well  moan  that  a determina- 
tion should  be  made  by  a grammar  pre-processor  as  to  which  attributes  have 
the  "one-pass"  property,  and  which  mast  he  computed  over  completed  deriva- 
tion trees,  with  different  strategies  used  rot  the  two  kinds  of  attributes 
This  determination  applied  to  an  attribute  yr.iiw.ar  would  be  straight- 
forward as  compared  to  a compiler  for  Woods's  ATLs  (burton  and  hoods 
1976)  because  of  the  more  regular  structure  of  the  attribute  grammar. 

At  the  level  of  implementation  tactics,  as  opposed  to  strategy, 
there  are  a number  of  challenging  problems  in  using  grammars  with, 
structured  vocabulary.  Some  of  the  techniques  have  been  worked  out  In 
connection  with  the  current  ‘Quince  parser  in  use  at  l.erkeley , and 
references  to  such  of  this  work  as  has  been  published  will  Le  found  in 
the  final  section. 

6.  The  Quince  System  and  tiramaars  with  f true tu red  Vocabulary 

It  would  be  note  surprising  than  not  if  grammars  with  structured 
vocabulary  played  no  part  in  the  existing  Quince  parsing  approach  and 
in  our  plans  lor  future  progress  --  especially  is  this  so  in  light  of 
the  argument  made  in  the  preceding  sections  that  structured  vocabulary 
is  important  (however  disguised)  in  the  grammars  or  procedures  of  a 1 1 
current  natural  language  systems.  We  have,  hoe  ever,  been  perhaps  more 
systematic  than  most  research  groups  in  our  past  ,h  vol.\  meut  of  this  topic 
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6. 1 Previous  Uses  of  Structured  Vocabulary  at  POLA 

How  to  utilize  a grammar  of  Chinese  with  structured  vocabulary 
has  been  a topic  of  research  at  the  Project  on  Linguistic  Analysis  since 
at  least  1970.  Prior  to  that  time  machine  translation  research  at 
Berkeley  (going  back  to  Sidney  Lamb’s  work  on  Russian  in  the  early 
1961 ’s)  had  made  use  of  grammars  with  symbols  which  were  systematically 
related  in  the  minds  of  the  grammarians,  but  this  relatedness  was  not 
exploited  in  the  computer  systems.  By  the  early  1970's,  a succession 
of  grammar-writers  had  introduced  several  different  and  in  part  in- 
compatible systems  of  structuring  the  vocabulary  of  the  grammars. 

Accordingly,  research  was  begun  on  how  a set  of  features  should 
be  used  with  the  existing  grammar  of  Chinese.  This  eventually  led  to 
a project  (described  at  length  in  Vang  and  Chan  1974)  to  write  a "core 
grammar"  of  basic  syntactic  categories,  augmented  with  a set  of  features. 
In  support  of  this  project  a reclassification  of  all  syntactic  categories 
was  undertaken,  and  a program  was  written  which  translated  between  the 
basic  category  symbols  plus  features,  and  a second  "extended"  set  of 
category  symbols  which  included  the  feature  information. 

Thus,  since  about  1972  Chinese  grammars  at  POLA  have  been  maintained 
using  indexes  of  "core  grammar  rules”  followed  by  their  feature  instan- 
tiations, so  that  a grammarian  did  not  need  to  understand  over  3000 
rules  directly. 

Studies  of  features  and  simplifications  of  the  grammar  continued, 
and  (Vang  and  Chan  1975)  reports  a wide  variety  of  Chinese  examples 
with  the  features  needed  for  correct  syntactic  analyses  in  terms  of  the 
basic  categories.  For  example,  the  feature  "abstractness"  is  found  to 
bo  essentia!  for  analysing  copula  sentences.  The  subject  noun  phrase 
must  agree  with  the  object  noun  phrase  with  respect  to  this  feature  in 
a copula  sentence.  An  actual  example  from  our  Physics  texts  is  extracted 
below  for  illustration.  Two  syntactic  analyses  are  possible  for  this 
sentence,  given  as  (a)  and  (b)  below,  but  only  (b)  gives  the  correct 
analysts  of  the  structure  of  the  sentence. 


example: 


from  Physics  T-2,  1972-10,  p.  61,  5tl»  parag. 
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Analysis  (a)  : 


be 1 o*?  is  discussed  -particle  is  the  possibility  of  an  electron 


’belov  is  discussed  the  possibility  of  ^(-particle  being  an  electron’ 


A mechanism  for  "feature  parsing"  was  designed  into  the  Quince 
parsing  system,  and  provisions  for  it  were  made  in  the  initial  implementa- 
tion of  the  parser.  The  practical  problem  of  giving  feature  encodings  to 
a complete  Chinese  dictionary  was  intractable  with  the  staff  available, 
and  so  a smaller  dictionary  was  constructed  with  which  to  work  from 
selected  physics  texts  for  feature  encoding. 

At  the  same  time,  other  distinct  changes  in  the  grammar  and  in  the 
method  of  applying  "interlingual  transformations"  were  made  for  the 
Quince  system,  so  that  a version  of  Knuth's  attribute  grammars  could  be 
employed,  with  the  interlingual  specification  of  structural  change  carried 
as  attributes  imposed  by  the  Chinese  grammar  rules. 

We  did  not  then  recognize  the  close  connections  between  these  two 
topics,  although  the  fact  that  they  were  being  worked  on  by  the  same 
staff  members  should  have  assured  that  they  would  converge  eventually, 
had  development  not  come  to  a temporary  halt  shortly  thereafter. 

6.2  Research  Areas  for  Future  Study 

All  work  to  date  indicates  that  the  primary  research  problem  in  the 
area  of  parsing  for  Chinese-English  machine  translation  lies  in  how  to 
define/describe  natural  languages  in  general,  and  how  to  define/describe 
Chinese  and  English  in  particular.  This  linguistic  analysis  is  the 
really  difficult  task,  compared  to  which  the  implementation  of  programs 
to  carry  out  the  analysis  is  straightforward. 

Therefore,  the  next  task  is  to  make  some  trials  of  recasting 
our  Chinese  grammar  into  various  notations  for  grammars  with  structured 
vocabulary,  to  see  what  format  appears  to  give  maximal  insight  in  use. 

At  present,  based  upon  the  earlier  experiences  d scribed  above,  it  seems 
that  a grammar  notation  based  jn  Knuth's  attribute  grammars  is  the  most 
promising.  Without  attempting  to  formulate  a realistically-large  fragment 
of  a grammar  for  a natural  language,  there  is  no  way  to  be  sure  about 
some  of  the  minor  (though  perhaps  crucial)  details,  because  prior  systematic 
uses  of  these  formalisms  have  been  in  connection  with  programming  languages. 

The  strong  tradition  of  using  structured  vocabulary,  however 
informally,  in  prior  linguistic  description  and  in  computational  projects, 
makes  us  quite  certain  that  systematic  attention  to  creating  a grammar 
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in  this  general  form  would  constitute  an  advance  over  previous  work. 

There  are  two  specific  areas  of  uncertainty  which  we  wish  to  begin 
immediately  to  clear  up: 

(1)  What  is  the  trading  relation  between  encoding  information  as 
category  symbols,  versus  features?  Clearly,  any  grammar  with  a finite 
number  of  rules  (any  such  hyper-grammar)  can  be  encoded  directly  by 
multiplying  out  category  symbols  and  rules  (no  matter  that  this  may  be 
wildly  impractical).  At  the  other  extreme,  one  can  imagine  a grammar 
containing  a single  category  symbol  ('"NODE")  and  one  rule  for  combining 
each  length  string  of  such  symbols  as  a single  symbol,  with  all  the 
information  carried  in  attributes.  Linguistic  tradition  suggests  an 
intuitive  division  of  grammatical  information  in  the  two  classes,  but  can 
a clearer  description  be  formulated? 

(2)  We  have  not  attempted  to  exploit  the  "internal  structure" 
given  by  the  meta-grammar  of  a van  Wijngaarden  grammar,  and  in  passing 
to  an  attribute  representation  one  passes  naturally  to  sets,  functions, 
etc.  as  attributes.  Cut  is  there  a way  to  exploit  the  fact  that  a tree 

is  given  by  a meta-grammer  for  every  meta-symbol  replaced,  without  getting 
confused  by  the  tree-manipulation  systems?  And  if  so,  would  it  be 
advantageous  to  restrict  attributes  to  being  tree-structured  (ignoring 
ns  much  structure  as  desired  in  particular  cases)? 

Doubtless  other  similar  questions  will  suggest  themselves  as  work 
continues. 

While  such  research  goes  forward  on  the  grammatical  side,  there 
are  also  many  questions  to  explore  about  a computer  parsing  procedure  for 
such  grammars.  The  fact  that  such  good  strategies  exist  for  parsing  the 
restricted  affix  grammars  suggests  that  a related  procedure  could  be 
devised  for  restricted  attribute  grammars  which  would  work  similarly 
(intcrfactorJng  parsing  and  attribute  value  calculation)  wherever  possible, 
and  only  do  more  work  (in  the  form  of  post-parse  processing  of  attributes 
in  one  or  more  additional  passes  over  the  tree)  when  necessary.  This 
could  be  of  Importance  in  making  the  full  exploitation  of  such  grammars 
possible,  although  the  techniques  reported  by  us  in  the  past  would  them- 
selves certainly  be  adequate  to  permit  the  most  important  uses  of  grammars 
with  structured  vocabulary  to  be  incorporated  directly  into  the  Quince  system. 
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APPENDIX  j RULES  FOR  A FRAGMENT  OF  CHINESE  GRAMMAR 


0 . < env  TF  > i — > < rrs  TF  CF  > 

1 , < rrs  TF  CF  > : -->  (<aadv>)  (<time  TF>)  <rs  TF  CF  > 


CF  ::  — > 

INTER  PASS  NEG  IMPER 

TF  ::  — > 

TENSE  ASPECT 

TENSE  ::  — > 

PAST  | PRES  | FUT 

PAST  ::  — > 

+past  -pres  -fut 

PRES  ::  --> 

-past  +pres  - fut 

FUT  : : --> 

-past  -pres  +fut 

ASPECT  : : — > 

PROG  PERF 

PROG  ! s — > 

+prog  | - prof  | EMPTY 

PFRF  : : --> 

♦perf  | -perf  | EMPTY 

Notes  j : 


env 

= 

environment 

TF 

= 

Time  Features 

rrs 

- 

root  root  sentence 

CF 

= 

Clause  Features 

aadv 

= 

sentence  adverbials 

rs 

= 

root  sentenoe 

INTER 

= 

interrofative 

PASS 

= 

passive 

NEG 

= 

negative 

IMPER 

= 

imperative 

PRES 

St 

present 

FUT 

= 

future 

PROG 

s 

progressive 

PERF 

= 

perfective 
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"I 

<■  • 


* ra 


* PS 


*-rs 


<ra 


<P3 


*■  pa 


«•  pa 


«-  pa 


TP  Cp>  ; --> 


?F  CF> 


* --> 


.-> 


CF> 


-> 


<np2  CF  SU  BJ>  *vp2  GF  TF  VF  VIRSUBJ  VIROBJ> 
•-where  VF  contain  +intran> 

* where  SUBJ  nonconflicts  VIRSUBJ> 

•- vp2  CF  TF  VF  VIRSUBJ  VIR0BJ> 

•-wiere  CF  contain  +imper> 

Mip2  CF  SUBJ>  <vp2  CF  VF  VF  VIRSUBJ  VIR0BJ> 
<np2  CF  OBJ> 

•-where  VF  contain  -intran> 

< where  SUBJ  nonconflicts  VIRSUBJ> 

<where  OBJ  nonconflicts  VIROBJ> 

<np2  CF  SUUJ>  <vp2  CF  TF  VF  VIRSUBJ  VIROBJ> 
<nn2  CF  OBJ> 


•"Where  VF  contain  +conula> 

•-where  SUBJ  onconflicta  OBJ> 

TF  CF>  <np2  CF  SUBJ>  <nr>2  CF  OBJ>  <vp2  CF  TF  VF 

VIRSUBJ  VIROBJ> 


•-where  VF  contain  -intran> 

•■where  SUBJ  nonconflicts  VTRSUBJ> 

•-where  OBJ  nonconflicts  VIROBJ> 

TF  GF>  «-np2  OF  0BJ>  <np2  CF  SVBJ>  <vp2  CF  TF  VF 

VIRSUBJ  VTHOBJ> 

•■where  VF  contain  -intran> 

•-where  OBJ  nonconflicts  VIROBJ> 

•■where  SUBJ  nonconflicts  VIRSUBJ> 

TF  C?>  : — > *-np2  CF  OBJ>  <vp2  CF  TF  VF  VIRSUBJ  VIHOBJ> 

••where  VF  contain  -intran> 

•■where  OBJ  tvmconf licts  VIROBJ> 

TF  CF>  : — > *-vp2  CF  TF  VF  VIRSUBJ  VIROUJ>  •m>2  CF  OBJ> 

•-where  VF  contain  -intr«n> 

•-where  OBJ  nonconflicts  V1R0BJ> 

TF  CF>  •-->  »-vu2  CF  TF  vp  VIRSUBJ  'FIROBJ>  -nnF  CF  03J> 

•-where  VF  contain  +rxist> 

•-where  OBJ  nonconflicts  VTROBJ> 


1S‘> 


SUBJ  : 

--> 

NF 

VIRSUBJ  : 

> 

NF 

OBJ  : 

NF 

vIROBJ  : 

- _ > 

NF 

NF  : 

_ - > 

+noun  U COMMON 

UCOMMON  : 

_ _ > 

♦common  UABSTRACT 

-abstract  UANIMITE 

U ABSTRACT 

• • 
• • 

_ > 

+abs tract  UMOBILE 

- abstract  UANTMATE 

U ANT MATE 

• • 
• t 

- > 

♦animate  UHT'MAN 

-animate  UBIOTIC 

UBIOTIC 

• • 
• • 

- > 

+biotic 

-biotic  UMOBILE 

UMOBILE 

♦ • 

_> 

+mobile  | - mobile 

U HUMAN 

: 

- > 

+hunan  j -human 

VF  : 

— > 

INTRAN  AGENT  ERG  REFLEX 

<ra  TP  CF>  :— > <nP2  GF  SUBJ>  <vp2  CF  TF  VF  VIRSUBJ  VIROBJ> 

<np2  CF  0BJ1>  <np2  CF  0BJ2> 

<where  VF  contain  -intran> 

<where  0BJ1  contain  +human> 

<where  SU3J  nonconflicts  VIRSUBJ* 

<where  OBJ  nonconflicts  VIROBJ> 

<rs  TF  CF>  <np2  CF  03J2>  <np2  CF  SUBJ*  «vd2  CF  TF  VF 

VIRSUBJ  VIR3BJ>  <np2  CF  OBJl> 

<where  VF  contain  -intran* 

<vhere  OBJ1  contain  +human> 

<wherc  OBJ  nonconflicts  VIR03J> 

<vihere  SUBJ  nonconflicts  virsUBJ* 

<rs  TF  CF>  <np2  CF  OBJ1  ■ <np2  CF  Sl*BJ>  <vp2  CF  TF  VF 

VIRSUBJ  VIR03J'"  <np2  CF  0BJ2> 

<where  OB.T1  contain  + human* 

<vhere  VF  contain  -intran> 

<where  OBJ  nonconflicts  VIR08J* 

^where  SUBJ  nonconflicts  VIRSUBJ* 
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Notes : 


vp 

ss 

Verb  Features 

VIRSUBJ 

= 

virtual  subject 

VIROBJ 

= 

virtual  object 

NP 

= 

Noun  Features 

INTRAN 

ss 

intransive 

ERG 

s 

ergative 

REFLEX 

« 

reflexive 

AUX 

= 

auxiliary 

EXIST 

s 

existential 

Rule  2 includes  the  following  types  of  sentences: 


Subject 

Verb 

( intransirive ) 

Verb  (Imperative 

sentence ) 

Subject 

Verb 

(nonintransitive ) 

Object 

Subject 

Verb  (copula) 

Object 

Subject 

Object 

Object 

Subject 

Verb  (nonintransitive) 

Object 

Verb  (nonintransitive) 

Verb 

(nonintrans i tive ) 

Object 

Verb 

(existential ) 

Object 

Subject 

Verb 

(nonintransitive ) 

Object 
( indirect ) 

Object 

(direct) 

Subject 

Verb 

(nonintransitive ) 

Object 
( indi rect ) 

Object 

(indirect) 

Subject 

Verb 

(nonintransitive ) 
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Verb  (nonin- 
transitive ) 


Object 

(direct) 


Object 
( d i re  c t ) 


3.  <vp2  CP  TF  VF  VIRSUBJ  VIROBJ> 


: — > 


(<pn  PF>)(<adv>)  <nep> 


<vpl 
< where 
<where 

<vp2  CF  TF  VF  VIRSIIBJ  VIROiiJ>  > 

<vpl 

<where 

<where 


CF  TF  VF  VIRSUBJ  VIROBJ> 
CF  contain  +neg> 

PF  nonconflict  VF> 

(<pp  PF>) (<adv>) 

CF  TF  VF  VIRSUBJ  VIR03J> 
CF  contain  -neg> 

PF  nonconflict  VF> 


Notes : 

PP  = prepositional  phrases 

PF  = Prepositional  Features 


adv  = adverbial 

nep  = negative 

li„  <np2  CF1  SU3J>  > <s  TF2  CF2  SUBJ>  | <npl  CF1  SUBJ>  | 

(<det>)  <r  el  CF1  SI3J>  <de>  | 

(<det>)  «-rel  CF1  HE AD>  <de>  <npl  C?1  HEAD  SITBJ> 
<np2  CF1  03 J>  : — > *■'  s TF?  CF2  03 J>  | <npl  CF1  0BJ>  | 

(<det>)  <rel  CF1  0BJ>  <de>  | 

(<det>)  <rel  CF1  HEAD>  <de>  <npl  CF1  HEAD  0BJ> 
HEAD  : : — > NF 


Notes : 

s * sentence 

det  = determiner 

rel  = relative  clause 

de  = relative  clause  marker  (terminal  symbol) 

5.  <rel  CF1  SUBJ>  s — > <3  CF1  SU3J  TF2  CF2>  | <npl  CF1  SUBJ>  | 

<pp  CF1  SUBJ> 

<rel  CF1  03J>  ;— > <3  CF1  OBJ  TF2  CF2>  | <npl  CF1  0BJ>  | 

<pp  CF1  03J> 

<rel  CF1  HEAD > <3  CF1  HEAD  TF2  CF2  > | <npl  CF1  HEAD>  | 

<pp  CF1  HEAD> 


Mi  ■ 


IV.  FURTHER  CONSOLIDATION  OF  THE  LINGUISTIC  DATA  BASE: 

LEXICAL  FEATURES  AND  INTERLINGUAL  TRANSFER  RULES 

I.  Introduction 

During  the  pest  few  years,  numerous  improvements  to  the  old  SAS  system 
had  been  made  or  conceived.  To  make  it  possible  to  incorporate  these  con- 


ceived improvements,  a new  system,  the  Quince  system  has  been  emerging.  In 
this  chapter,  some  previously  conceived  improvements  to  the  two  areas  of  the 
linguistic  data  base,  the  lexicon  and  interlingual  transfer  rules,  will  be 
elaborated  In  the  light  of  recent  developments  in  linguistic,  artificial  in- 
telligence, and  computational  linguistic  theories. 

The  discussion  on  the  lexicon  will  be  focussed  on  lexical  features, 
whose  implementation  will  be  the  most  important  single  improvement  to  the 
lexicon.  The  preceding  chapter  has  already  provided  us  with  a conceptual 
framework  in  which  feature-handling  mechanisms  can  be  feasibly  implemented. 
The  neture  end  functions  of  these  lexical  features  and  their  relationships 
to  the  other  components  of  the  linguistic  deta  base  will  be  described.  The 
choice  of  the  lexical  features  and  the  types  of  lexical  information  for  the 
lexical  entries  in  the  lexicon  will  also  be  touched  upon. 

The  interlingual  transfer  operation  was  conceived  as  an  independent 
phase  in  the  translation  cycle.  The  Interlingual  transfer  rules  were  regard- 
ed as  belonging  to  an  independent  component  of  the  linguistic  data  base.  In 
this  chapter,  the  actual  separation  of  the  interlingual  tranafer  rules  from 
the  analysis  rules  trill  be  emphasized  once  more.  The  nature,  functions  and 
different  types  of  the  interlingual  transfer  rules  and  a possible  way  of  im- 
plementing them  will  be  discussed.  To  further  improve  the  interlingual  trans 
far  component,  contrastive  lexical  and  syntactic  studies  and  contextual  ana- 


lysis will  also  be  outlined  as  part  of  future  endeavors. 
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2.  Lexical  Features 


The  lexical  features  as  defined  here  include  the  following  types: 
semantic,  syntactic  or  morphological,  contextual,  and  rule  features.  Each 
lexical  entry  in  the  lexicon  may  contain  all  or  part  of  these  feature  types. 
Semantic  features  refer  to  what  Katz  (1972)  called  semantic  markers.  Examples 
are  [liumanl,  i Object],  LAnimate"),  etc.  Syntactic  or  morphological  features 
are  those  obligatory  grammatical  distinctions  which  a language  imposes  on  its 
surface  representation.  Examples  in  English  are  gender  distinction  in  the 
third  person  singular  pronoun,  singularity  or  plurality  of  countable  nouns, 
etc.  Contextual  features  refer  to  those  contexts  in  which  a lexical  item  may 

occur.  For  example,  the  contextual  feature  for  an  English  noun  is  [+Det  

and (+ NPj  for  an  English  transitive  verb.  The  rule  features  are  those 

which  indicate  which  particular  interlingual  transfer  rule(s)  a particular 
lexical  item  will  or  will  not  participate  in. 

2.1  Nature  of  Semantic  Features 

The  names  "semantic  features",  "semantic  markers",  and  "semantic 
primitives"  are  roughly  equivalent  terms  used  by  different  researchers  in 
different  disciplines.  They  are  theoretical  constructs  intended  to  represent 
basic  conceptual  units  or  general  sense-components.  Since  there  is  no  unique 
way  of  breaking  down  the  universe  into  the  basic  conceptual  units,  different 
researchers  have  different  lists  of  those  units.  For  example,  Wilks  (1973d) 
gave  a list  of  sixty  semantic  primitives  while  Wierzbicka  (1972)  listed  only 
fourteen.  Based  on  our  conception  of  the  functions  of  semantic  features  as 
stated  below,  we  are  not  providing  an  exhaustive  list  of  them  adequate  for 
the  analysis  of  the  vocabulary  of  the  Chinese  language.  Only  those  semantic 
features  which  will  best  account  for  our  data  and  serve  our  practical  ends 
will  enter  our  feature  list. 

2.2  Functions  of  Semantic  Features 

Recent  developments  in  semantic  primitives  or  semantic  features  seem  to 
have  started  in  componential  analysis  In  anthropology.  In  anthropological 
componential  analysis,  semantic  features  are  Intended  to  be  the  building 
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blocks  of  lexical  fields.  In  some  well-defined  lexical  fields,  componential 
analysis  has  been  successful,  but  it  cannot  be  vigorously  applied  to  the  many 
fields  which  are  not  well  defined  with  the  same  degree  of  success. 

In  linguistic  semantics,  semantic  features  are  used  mainly  to  indicate 
meaning  relations  among  the  lexical  entries  of  the  lexicon.  They  are  also 
used  to  explain  semantic  anomaly  in  terms  of  feature  Incompatibility  within 
a constituent. 

In  artificial  intelligence,  semantic  primitives  and  relations  are  the 
building  blocks  of  semantic  networks,  which  represent  meanings  or  conceptuali- 
sations of  words  or  sentences  in  a language. 

Within  the  conceptual  framework  of  our  feature  grammar  outlined  in  the 
previous  chapter,  the  primary  function  of  semantic  features  In  the  rules  and 
the  lexicon  is  to  provide  an  elegant  means  of  capturing  general  conditions  on 
8yntagmatic  collocation  or  co-occurrence  restriction.  Those  conditions  on  co- 
occurrence restriction  are  to  be  used  as  well-formedness  conditions,  to  help 
disambiguate  sentences  and  to  throw  out  semantically  anomalous  Interpretations. 

In  our  future  feature  approach  to  analysis  only  those  semantic  features 
which  appear  both  in  the  lexicaon  and  rules  will  be  used.  That  is,  only 
those  semantic  features  which  have  grammatical  consequences  will  be  entered 
in  the  lexicon.  In  this  conceived  grammar  of  Chinese,  the  grammar  rules 
will  Incorporate  semantic  features  as  part  of  their  well-formedness  condi- 
tions. Those  conditions  on  the  rules  will  check  the  directly  dominated  non- 
terminals or  terminals  for  compatibility.  Only  if  no  conflict  arises  will 
the  constituent  under  consideration  be  accepted  as  being  well-formed. 


2,3  Building  up  the  Semantic  Feature  Set 

The  first  step  to  build  up  the  semantic  features  for  the  future  grammar 
rules  and  lexicon  is  to  select  and  extract  those  relevant  features  contained 
in  the  existing  grammar  codes.  Since  those  features  were  well-motivated  to 
capture  general  co-occurrence  restrictions  in  Chinese,  they  can  be  taken  over 
without  too  much  modification.  The  second  step  is  to  enter  the  extracted 
features  in  the  respective  lexical  entries. 

The  existing  features  will  be  insufficient  for  our  purpose  in  the 
future.  As  research  goes  on,  we  expect  more  features  to  be  invoked  to  make 
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our  grammar  more  sufficient.  For  example,  a new  aet  of  semantic  features  will 
be  needed  if  co-occurrence  restrictions  such  as  those  between  classifiers  and 
nouns  and  in  noun  compounding  are  to  be  more  adequately  formulated  than  they 
are  now.  For  ideas  of  semantic  features  that  may  be  needed  in  the  future,  the 
feature  set  proposed  in  the  1974  Final  Technical  Report,  p.  38,  could  be  con- 
sulted. Other  lists  such  as  those  given  in  artificial  intelligence  litera- 
ture, Wilks  (1972,  1973a)  and  Schank  (1973,  1975c)  may  also  be  helpful. 

The  co-occurrence  restrictions  contained  in  the  existing  grammar  codes 
can  be  extracted  and  restated  in  the  new  rules  of  our  grammar.  Some  of  the 
co-occurrence  restrictions  will  have  to  be  restated  as  our  understanding  of 
them  deepens.  For  example,  the  co-occurrence  restrictions  between  the  verb 
and  the  subject  and/or  the  object  as  indicated  by  the  syntactic  subcategori- 
sation in  the  current  grammar  have  to  be  revised  once  they  are  better  under- 
stood in  terms  of  case  relationships.  Instead  of  a single  co-occurrence 
restriction  between  the  verb  and  subject  and/or  object,  we  may  have  to  allow 
alternatives  in  a preferential  scale  in  many  cases. 

In  the  existing  grammar  codes,  the  semantic  features  could  only  assume 
the  positive  value  owing  to  the  nature  of  the  rules  themselves.  In  the 
future  grammar,  each  semantic  feature  should  be  allowed  to  have  either  the 
positive,  negative  or  unmarked  value.  The  unmarked  value  is  intended  to  be 
used  in  those  lexical  entries  where  the  dichotomous  contrast  with  respect  to 
a certain  semantic  dimension  is  neutralized.  By  allowing  the  negative  and 
unmarked  values  for  the  semantic  features  the  rules  of  the  grammar  and  the 
lexicon  will  be  greatly  simplified. 

2.4  The  Other  Types  of  Lexical  Features 

The  distinction  between  the  semantic  features  discussed  above  and  the 
syntactic  or  morphological  features  mentioned  is  section  2 can  be  at  times 
very  fussy,  simply  because  the  line  drawn  between  syntax  and  semantics  is  not 
always  clear.  For  our  purpose,  there  is  no  need  to  make  the  distinction  be- 
tween these  two  types  of  features.  The  contextual  features  as  defined  above 
refer  to  the  syntactic  environments  in  which  lexical  items  of  a certain 
grammatical  category  or  constituents  can  be  predicted  in  terras  of  other 
categories  or  constituents  when  they  concatenate.  Both  the  syntactic  and 
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contextual  features  are  In  part  preaent  In  the  grammar  codes  of  the  current 
grammar  and  can  be  selected,  extracted,  and  entered  Into  respective  lexical 
entries.  Whenever  necessary  new  features  of  these  two  types  can  be  Incor- 
porated. 

The  last  feature  type,  rule  features,  are  Intended  to  either  trigger 
interlingual  transfer  rules  or  to  handle  exceptions  to  them.  The  use  of 
rule  features  will  greatly  simplify  the  interlingual  transfer  rules  and  only 
slightly  increase  the  complexity  of  the  lexicon.  If  the  exceptions  to  the 
interlingual  transfer  rules  were  not  registered  in  the  lexicon,  they  would 
have  to  be  handled  by  either  writing  less  general  rules  or  by  further  aub- 
categorislng.  The  rule  features  as  defined  here  and  in  section  2 can  be 
discovered  only  after  the  interlingual  transfer  rules  have  been  formulated. 


2.5  Types  of  Lexical  Information 

The  Incorporation  of  the  lexical  features  into  the  lexicon  will 
necessitate  change  of  the  existing  dictionary  format.  The  existing  format 
used  for  the  SAS  allows  only  the  information  of  grammar  code,  telecode, 
English  gloss,  and  romanlxatlon.  In  the  future  format,  at  least  the  follow- 
ing types  of  information  whenever  applicable  for  each  lexical  entry  in  the 
dictionary  should  be  included:  telecode,  syntactic  category,  syntactic  and 
semantic  features,  contextual  features,  rules  features,  English  gloss,  and 
lexical  disambiguating  heuristics.  As  we  will  see  below  (section  3.3),  the 
English  gloss  should  be  based  on  extensive  contrastive  lexical  studies.  The 
lexical  disambiguating  heuristics  will  be  based  on  the  immediate  contextual 
information.  Whenever  low-level  ambiguities  arise  this  contextual  informa- 
tion will  be  consulted  first  to  resolve  the  ambiguities. 


3.  Interlingual  Transfer  Rules 

The  interlingual  transfer  was  originally  conceived  as  an  independent 
phase  of  the  translation  cycle  (see  Wang,  et  al.  1971).  Under  this  conception 
the  output  of  the  analysis  phase  becomes  the  input  to  the  interlingual  trans- 
fer component.  In  practice,  however,  the  interlingual  transfer  rules  were 
incorporated  into  the  Chinese  grammar  itself.  Attempts  to  separate  them 
from  the  analysis  rules  were  made  but  have  not  yet  been  fully  implemented 
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with  appropriate  machine  programs.  Although  the  analysis  of  Chinese  is  not 
an  end  in  itself  in  machine  translation,  the  mixture  of  these  two  phases 
may  always  confuse  the  issue  and  complicate  the  tasks  at  each  phase.  An 
independent  component  of  interlingual  transfer  rules  is  both  conceptually 
sound  and  linguistically  practical.  The  separation  should  be  carried  out  as 
Soon  as  possible  in  the  next  phase  of  research. 


3.1  Interlingual  Component  in  Artificial  Intelligence  Approaches  to  Machine 
Translation 

Wilks  (1973a,  197.3b)  described  an  English-French  machine  translation 
system  at  Stanford  University,  which  follows  an  "artificial  intelligence" 
approach.  Briefly  speaking,  in  this  approach  English  sentences  of  a para- 
graph are  first  converted  into  semantic  or  conceptual  representations  by- 
passing tbe  syntactic  analysis  stage.  Corresponding  French  sentences  are 
then  generated  from  those  semantic  or  conceptual  representations.  Schank 
(1975b) also  gave  a brief  account  of  an  artificial  intelligence  approach  to 
machine  translation.  The  semantic  or  conceptual  representation  is  repre- 
sented by  some  kind  of  semantic  network  where  the  meaning (s)  of  a sentence 
is  represented  by  primitive  conceptual  units  and  their  relations  (see  Woods, 
1975).  It  is  supposed  to  he  a universal  interlingua.  Any  natural  language 
can  be  decoded  into  it  and  encoded  into  another  language. 

One  of  the  motivations  behind  the  semantically-based  approach  to 
machine  translation  is  "...  that  the  space  of  meaningful  expressions  of  a 
natural  language  cannot  be  determined  or  decided  by  any  set  of  rules  what- 
ever — in  the  way  that  almost  nil  linguistic  theories  implicitly  assume  CAN 
be  done."  (Wilks,  1973a)  According  to  Wilks,  any  string  of  words  can  be 
made  meaningful  by  the  use  of  explanations  and  definitions.  But  under 
current  linguistic  theories,  those  meaningful  expressions  may  be  excluded  as 
unacceptable.  Wilks'  observation,  though  very  true,  should  only  be  taken 
with  some  caution  at  this  stage  of  machine  translation  research.  Nobody 
has  ever  come  up  with  any  grammar  for  any  natural  language  that  is  capable 
of  describing  all  the  well-formed  sentences  in  that  language  under  any  lin- 
guistic theory,  not  to  say  a grammar  for  Interlingua. 

It  is  doubtful  that  the  artificial  intelligence  to  machine  translation 
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described  above  really  reflects  human  translation  process.  Besides,  there 
are  problems  with  semantic  network  representation  of  meanings  for  natural 
language.  Woods  (1975)  gave  a critical  review  of  the  semantic  network  re- 
presentation of  meanings  for  natural  language.  The.  problems  seem  to  center 
around  the  issue  of  how  to  capture  all  the  relevant  meanings  embodied  in 
the  very  rich  syntactic  structure  of  a natural  language  In  a semantic  net- 
work representation.  On  the  basis  of  the  current  state-of-the-art,  our 
syntactically-based  approach  to  machine  translation  should  be  maintained 
in  the  next  phase  of  research. 

3.2  Nature,  Functions,  and  Types  of  and  Formalism  for  the  Interlingual 
Transfer  Rules 

In  our  conception  of  machine  translation,  it  has  been  assumed  that  the 
English  glosses  in  the  dictionary  will  give  us  the  necessary  meaning  elements 
in  English  sentences.  Syntactic  rearrangements  of  those  elements  in 
accordance  with  the  English  syntax  and  morphological  adjustments  to  those 
English  glosses  in  their  base  forma  will  give  us  the  correct  English  output, 
semantically,  syntactically,  and  morphologically.  We  will  accept  this  con- 
ception as  generally  correct,  except  for  some  of  the  issues  raised  below. 

Under  the  above  conception,  the  output  of  the  analysis  phase,  Chinese 
trees,  will  undergo  syntactical  and  morphological  adjustments,  which  are 
rules  of  the  interlingual  transfer  component,  to  arrive  at  corresponding 
English  sentences.  Morphological  adjustments  are  always  idiosyncratic 
(i.e.,  lexical)  in  nature.  In  the  following  discussion,  the  types  of  inter- 
lingual transfer  rules  refer  only  to  syntactic  adjustments. 

All  the  Interlingual  transfer  rules  contained  in  the  current  Chinese 
grammar  will  be  sorted  out  and  stated  in  terms  of  the  three  basic  tree 
operations:  deletion,  substitution,  and  adjunction.  We  will  follow  the 
formalism  presented  in  Friedman  (1971)  and  Morin  (1973)  for  representing 
these  tree  operations.  So  far  only  a small  portion  of  tho  existing  rules 
have  been  recast  into  this  formalism. 

Each  interlingual  transfer  rule  will  take  the  form  of  a transforma- 
tional rule.  It  will  consist  of  a structural  description  and  a structural 
change.  The  structural  description  will  be  based  on  the  principle  of 
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"analyzability"  or  "proper  analysis." 

3.3  Contrastive  Lexical  anJ  Syntactical  Studies 

In  the  above  discussion  it  was  assumed  that  correct  English  glosses 
plus  appropriate  syntactic  adjustments  to  a correctly-analyzed  Chinese  tree 
will  produce  a correct  corresponding  English  sentence.  How  can  we  arrive 
at  the  correct  glosses  and  appropriate  syntactic  adjustments?  The  solutions 
seem  to  hinge  on  extensive  contrastive  lexical  and  syntactic  studies  be- 
tween Chinese  and  English. 


3.3.1  Contrastive  Lexical  Studies 

In  order  to  arrive  at  the  correct  English  glosses  for  their  Chinese 
counterparts,  it  is  necessary  to  compare  and  contrast  them  in  terms  of  their 
participation  in  a scene,  or  a schema,  or  a frame  in  their  respective 
languages.  Linguistically  speaking,  a frame  refers  to  either  a situational 
context  or  a lexical  network  which  a word  invokes.  On  the  basis  of  its 
role (8)  in  a frame,  the  correct  interpretation  of  a word  in  the  source 
language  can  be  rendered  closest  to  its  counterpart  to  the  target  language. 
This  approach  to  contrastive  lexical  studies  is  partially  in  accord  with 
the  principle  of  structural  semantics  which  states  that  the  meaning  of  a 
word  in  a language  is  determined  by  its  paradigmatic  relationships  to  other 
lexical  items  in  the  same  paradigm. 

In  addition  to  the  paradigmatic  relations,  the  syntagmatic  lexical 
relations  between  any  two  words  in  both  language  have  to  be  taken  into 
account.  It  has  been  familiar  to  linguists  in  the  field  of  contrastive 
lexicography  that  words  of  s«ne  or  similar  meanings  in  two  different  languages 
may  not  have  same  or  similar  syntactic  bahavior.  In  many  cases,  the  glosses 
of  some  Chinese  lexical  items,  especially  those  "empty"  words,  cannot  be 
given  the  appropriate  ones  without  contrastive  studies  of  their  syntagmatic 
behavior  in  both  languages  being  made.  Right  steps  in  this  direction  had 
been  taken  in  the  past.  More  work  needs  to  be  done  in  the  future  to  update 
the  whole  dictionary  to  enable  producing  better  English  translations. 
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3.3.2  Contrastive  Syntactic  Studies 

In  the  past,  the  Ungulate  of  our  project  had  to  carry  out  original 
contrastive  syntactic  studies  between  Chinese  and  English  because  of  a lack 
of  research  by  others  in  this  area.  Many  interlingual  transfer  rules  based 
on  those  studies  were  written.  As  research  goes  on.  some  old  rules  will 
have  to  be  revised  and  new  ones  added. 

In  order  to  accommodate  the  earlier  machine-implemented  SAS,  the 
interlingual  transfer  rules  were  embodied  in  the  Chinese  grammar,  whenever 
a certain  syntactic  adjustment  was  needed  for  outputing  correct  English  . Tt 
was  Introduced  in  all  the  rules  that  needed  it.  In  the  future  when  the 
Interlingual  rule  component  is  separated  from  the  grammar  component,  rule 
schemata  can  be  used  to  capture  the  generalisation  of  those  structural 
changes. 

The  strategy  to  be  followed  to  uncover  the  syntactic  correspondences 
between  Chinese  and  English  will  be  to  systematically  compare  and  contrast 
the  sentential  and  phrasal  structures  according  to  their  types.  Files  of 
the  Chinese  sentence  and  phrase  types  will  first  be  built  up.  Representa- 
tive token  sentences  and  phrases  from  each  type  will  then  be  translated 
into  the  corresponding  English  sentences  and  phrases,  with  as  few  syntactic 
adjustments  as  possible.  At  the  same  time,  the  same  syntactic  adjustments 
will  be  attempted  for  sentences  or  phrases  of  the  same  type  unless  fidelity 
is  violated.  Dy  doing  so,  it  is  hoped  that  rules  of  greater  generalization 
can  be  uncovered.  The  Chinese  sentences  and  phrases  will  later  be  compared 
and  contrasted  with  their  English  counterparts.  According  to  the  scope  of 
the  systematic  syntactic  correspondences  uncovered,  Interlingual  transfer 
rules  of  varying  generalisation  will  be  formulated.  Sporadic  exceptions  to 
the  rules  will  be  entered  as  rule  features  In  the  relevant  lexical  entries  so 
that  other  adjustments  can  be  attempted. 

The  results  of  the  extensive  contrastive  lexical  and  syntactic  studies 
between  Chinese  and  English  outlined  above  will  greatly  enhance  our  linguis- 
tic data  base.  Those  results  will  also  be  of  great  relevance  to  both  teach- 
ing English  or  Chinese  as  a foreign  language  and  Chinese-English  bilingual 
dictionaries. 
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3 . 4 Contextual  Analysis 

In  our  syntax-based  approach  to  machine  translation,  units  of  linguis- 
tic analysis  and  translation  are  sentences  or  sub-sentences.  This  decision 
was  based  on  practical  considerations.  However,  they  are  many  cases  where 
the  syntactic  and  semantic  information  within  those  parse  units  alone  is  not 
enough  to  resolve  ambiguities  in  then.  Information  from  the  surrounding 
contexts,  both  linguistic  and/or  situational,  is  necessary  to  do  the  work. 

Linguistic  contexts  are  provided  by  visible  surrounding  linguistic 
units,  and  the  cues  for  resolving  ambiguities  in  one  sentence  or  sub-sentence 
are  those  lexical  items  or  syntactic  features  in  those  units.  The  lexical 
and  syntactical  cues  do  not  have  to  be  in  the  immediately  preceding  or 
following  sentence.  Situational  context,  or  more  generally,  knowledge  of 
the  world  provides  the  richer  and  mo^e  important  disambiguating  information 
of  the  two.  It  is  more  reliable  than  linguistic  context  but  more  elusive. 

It  is  probably  by  far  the  most  important  criterion  of  selecting  the  correct 
or  preferred  interpretation  among  the  many  possible  ones  from  the  point  of 
human  language  processing.  However,  world  knowledge  is  enormous  and  cannot 
be  feasibly  incorporated  in  any  natural  language  processing  system  of  a 
great  world  domain.  Even  if  we  want  to  be  more  selective,  we  are  always 
hampered  by  our  incapability  of  predicting  which  part  of  our  world  knowledge 
will  be  useful  or  necessary  In  the  system. 

Although  in  a machine  translation  system,  unlike  a question-answering 
system,  the  world  knowledge  can  be  filled  in  by  the  reader  of  the  transla- 
tion; nonetheless,  it  is  desirable  to  resolve  as  many  ambiguities  as  possible 
during  parsing  and  fill  in  as  many  linguistic  gaps  as  possible  during  inter- 
lingual transfering  for  the  reader  of  the  translation.  In  the  following, 
a few  areas  of  the  Chinese  grammar  will  be  exemplified  to  show  the  needs  for 
contextual  information. 

3.4.1  Elided  Subjects 

In  Chinese  writing  In  general,  and  scientific  writing  in  particular, 
many  sentences  or  clauses  have  their  subjects  elided.  They  are  omitted 
because  they  are  "understood"  or  "recoverable"  from  the  contexts.  In 
English,  recoverable  subjects  are  limited  to  a few  well-defined  linguistic 
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contexts,  but  in  Chinese  the  elided  subjects  cannot  be  determined  in  purely 
linguistic  terns.  Extra-linguistic  contexts  are  also  involved.  In  order  to 
output  "readable"  English,  rules  or  heuristics  to  fill  in  the  elided  subjects 
have  to  be  uncovered.  Their  discovery  depends  on  a thorough  contextual 
analysis  of  a huge  corpus  of  Chinese  texts. 

Our  preliminary  investigation  indicates  that  the  author (s)  of  an 
article  or  textbook  is  the  most  frequently  elided  subject.  In  many  other 
cases,  the  elided  subject  is  the  Indefinite  third  person  pronoun.  Chinese 
sentences  with  elided  subjects  in  the  above  two  cases  can  always  be  translated 
into  corresponding  English  subjectless  passives.  However,  extra-linguistic 
considerations  for  contextual  coherency  may  override  these  general  stylistic 
conventions.  It  is  those  considerations  that  cause  the  trouble.  Unless  we 
come  up  with  some  principles  of  extra-linguistic  contextual  coherency  other 
than  the  general  conventions  of  stylistics  in  Chinese  writing,  we  may  re- 
cover the  wrong  subjects.  Sometimes,  linguistic  well-formedness  conditions 
may  rule  out  the  possibility  of  certain  nouns  In  the  surrounding  sentences 
being  the  elided  subject,  but  they  cannot  determine  which  noun  is  the  one. 

In  other  occasions,  linguistic  cohesive  devices,  such  as  connectives,  may 
provide  cues  for  recovering  the  elided  subject.  Much  research  in  the  area 
of  the  principles  of  extra-linguistic  contextual  coherency  needs  to  be  done 
in  the  future  to  solve  the  problem  of  elided  subjects  in  Chinese. 

3.4.2  Number,  Tense,  and  Aspect 

In  Chinese  the  number  of  a noun  and  the  tense  and  aspect  of  a verb  are 
not  morphologically  marked.  The  number  in  many  cases  surfaces  as  quantifiers 
or  determiners:  and  tense  and  aspect  are  often  indicated  by  time  nouns, 
adverbs,  or  particles,  or  any  combination  of  then.  Tn  some  cases,  however, 
none  of  the  overt  markers  exist.  Since  these  syntactic  features  are  obliga- 
tory in  English,  they  have  to  he  inferred  from  the  Chinese  contexts  whenever 
the  overt  markers  arc  absent.  As  in  the  case  of  elided  subjects,  we  can 
sometimes  rely  on  such  cohesive  device  as  connectives  to  provide  the  nece- 
ssary Information  for  the  English  reader.  Information  such  as  the  organisa- 
tion of  events  along  the  temporal  axis  and  Its  cues  in  a text  may  also  be 
helpful  in  assigning  the  correct  tense  and  aspect  to  the  unmarked  verbs. 
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Information  of  this  type  can  be  gathered  only  through  contextual  analysis. 

3.4.3  Definite  vs.  Indefinite  Reference 

Definite  vs.  indefinite  reference  to  nouns  is  an  obligatory  feature  in 
the  English  grammar.  In  Chinese  it  is  a derived  feature  and  is  not  morpho- 
logically marked  in  all  cases.  When  this  opposition  is  not  marked  for  a 
particular  noun  in  a Chinese  clause  or  sentence,  only  the  context  can  pro- 
vide information  for  the  reader  to  make  a decision.  Generally  speaking,  the 
indefinite  reference  is  often  expressed  by  a preceding  indefinite  quantifi- 
cation expression  or  by  default  of  any  preceding  definite  expression  while 
the  definite  reference  is  always  expressed  by  repetition  of  a preceding 
noun,  an  anaphoric  expression,  or  a preceding  demonstrative.  Other  overt 
linguistic  cues  for  the  definite  vs.  indefinite  reference  to  Chinese  nouns 
need  to  be  further  investigated.  In  cases  where  no  overt  cues  are  available, 
extra-linguistic  contextual  analysis  is  necessary  to  provide  the  information 
for  this  opposition  in  English.  The  semantic  notions  of  new  vs.  old  or 
shared  information,  and  of  generic  vs.  specific  or  unique,  may  be  helpful 
in  the  extra-linguistic  contextual  analysis. 

We  have  briefly  discussed  the  three  areas  of  the  Chinese  grammar  where 
some  kind  of  contextual  information  has  to  be  gathered  and  passed  from 
Chinese  sentences  into  the  corresponding  English  sentences  to  produce  "read- 
able" translations.  There  are  three  major  problems  related  to  contextual 
analysis  that  have  to  be  tackled  in  the  near  future.  They  are:  (1)  to  deter- 
mine the  relevant  contextual  Information,  (2)  to  gather  this  information,  and 
(3)  to  implement  the  information  in  the  system.  As  the  fields  of  linguistics, 
artificial  intelligence,  computational  linguistics,  and  others  advance,  it  is 
hoped  that  solutions  to  these  problems  may  soon  emerge. 
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